Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Md syntax compare #356

Merged
merged 6 commits into from
Apr 23, 2020
Merged

Md syntax compare #356

merged 6 commits into from
Apr 23, 2020

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented Apr 20, 2020

This script allows comparing structural content (element tree) of the markdown files. This compares English (original) with the translated markdowns files, generates html and then produces diffs for any structural differences found.

Markdown files are transformed into html and parsed into element tree (using lxml). Documents are then traversed and turned into string representations /path/to/tag#attr1=value1#attr2=value2. Lists of these strings are then compared and diffed for each (english, translation) pair.

Flags for this script are:

  1. --langs [de,fr,...], restrict analysis only for given translations
  2. --include-regex, only analyze tags where the string representation matches given regex (e.g. "/h.#" to compare headers)
  3. --exclude-regex, skip tags where string representation matches given regex (e.g. "/p#" to skip over paragraphs)
  4. --show-tag-summaries, summarizes how often did certain tags differ between original and translation (this can be used to find out which tags we should focus on).
  5. --hide-diffs if you want to hide per-file html tag diffs.

This script can be used to find issues with markdown styling in translations and detect files where style has drifted between the original content and translations. The underlying issue is described in #354

rousik added 6 commits April 20, 2020 09:25
This script finds all md files, associates all
translations and analyzes the html tag structure
and can detect when translations differ from the
english original.

Use --lang de,fr,... to restrict the operation
to specific subset of languages only.

Use --include-regex to only scan for matching tags.

Use --exclude-regex to remove tags that are often
irrelevant (e.g. /p#).

The above regexes are applied to string representation
of the html tag which uses the following schema:

/enclosing/tag/names#attr1=val1#attr2=val2#...

To match specific tag, you want to use /tagname# regex.
1. sort imports
2. eliminate spurious newlines from diffs
3. return non-zero if diffs found
4. only report on languages where diffs are found
1. added --show-tag-summaries. Shows summaries of
tags that diverge.

2. Exclude alt= attribute when building string representation.
This is bound to always be different.
@github-actions
Copy link
Contributor

This pull request is being automatically deployed with now-deployment

Built with commit 3c9d259

✅ Preview: https://guide-preview-2ftn8934r.now.sh

Copy link
Collaborator

@matiasgarciaisaia matiasgarciaisaia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the script is OK. Not really good at Python, but I didn't see anything nasty 👍.

I'd like to have some instructions on how to install the dependencies to be able to run this. I guess that'll come with the PR that actually runs this?

@rousik
Copy link
Collaborator Author

rousik commented Apr 23, 2020

Noted. I think it might make sense to document how to use this script on our wiki page (along with instructions on how to install dependencies). If it will be built into action, this will come with the dependency installation too.

@rousik
Copy link
Collaborator Author

rousik commented Apr 23, 2020

Seems like using docker is the standard way to wrap things in this project so I will investigate if I could use that.

@rousik rousik merged commit 99665e5 into master Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants