Md syntax compare #356

rousik · 2020-04-20T21:37:12Z

This script allows comparing structural content (element tree) of the markdown files. This compares English (original) with the translated markdowns files, generates html and then produces diffs for any structural differences found.

Markdown files are transformed into html and parsed into element tree (using lxml). Documents are then traversed and turned into string representations /path/to/tag#attr1=value1#attr2=value2. Lists of these strings are then compared and diffed for each (english, translation) pair.

Flags for this script are:

--langs [de,fr,...], restrict analysis only for given translations
--include-regex, only analyze tags where the string representation matches given regex (e.g. "/h.#" to compare headers)
--exclude-regex, skip tags where string representation matches given regex (e.g. "/p#" to skip over paragraphs)
--show-tag-summaries, summarizes how often did certain tags differ between original and translation (this can be used to find out which tags we should focus on).
--hide-diffs if you want to hide per-file html tag diffs.

This script can be used to find issues with markdown styling in translations and detect files where style has drifted between the original content and translations. The underlying issue is described in #354

This script finds all md files, associates all translations and analyzes the html tag structure and can detect when translations differ from the english original. Use --lang de,fr,... to restrict the operation to specific subset of languages only. Use --include-regex to only scan for matching tags. Use --exclude-regex to remove tags that are often irrelevant (e.g. /p#). The above regexes are applied to string representation of the html tag which uses the following schema: /enclosing/tag/names#attr1=val1#attr2=val2#... To match specific tag, you want to use /tagname# regex.

1. sort imports 2. eliminate spurious newlines from diffs 3. return non-zero if diffs found 4. only report on languages where diffs are found

1. added --show-tag-summaries. Shows summaries of tags that diverge. 2. Exclude alt= attribute when building string representation. This is bound to always be different.

github-actions · 2020-04-20T21:38:30Z

This pull request is being automatically deployed with now-deployment

Built with commit 3c9d259

✅ Preview: https://guide-preview-2ftn8934r.now.sh

matiasgarciaisaia

I guess the script is OK. Not really good at Python, but I didn't see anything nasty 👍.

I'd like to have some instructions on how to install the dependencies to be able to run this. I guess that'll come with the PR that actually runs this?

rousik · 2020-04-23T14:12:34Z

Noted. I think it might make sense to document how to use this script on our wiki page (along with instructions on how to install dependencies). If it will be built into action, this will come with the dependency installation too.

rousik · 2020-04-23T14:14:00Z

Seems like using docker is the standard way to wrap things in this project so I will investigate if I could use that.

rousik added 6 commits April 20, 2020 09:25

Use non-deprecated logger.warning methods.

385fcfc

Skip over md files in build/ directory

1c35975

Few minor stylistical changes.

9ac1dd1

1. sort imports 2. eliminate spurious newlines from diffs 3. return non-zero if diffs found 4. only report on languages where diffs are found

Few operational improvements to the script.

b547687

1. added --show-tag-summaries. Shows summaries of tags that diverge. 2. Exclude alt= attribute when building string representation. This is bound to always be different.

Add --hide-diff option and remove outdated TODO.

3c9d259

rousik assigned nditada, mverzilli and matiasgarciaisaia and unassigned nditada Apr 20, 2020

matiasgarciaisaia approved these changes Apr 22, 2020

View reviewed changes

rousik merged commit 99665e5 into master Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Md syntax compare #356

Md syntax compare #356

rousik commented Apr 20, 2020

github-actions bot commented Apr 20, 2020

matiasgarciaisaia left a comment

rousik commented Apr 23, 2020

rousik commented Apr 23, 2020

Md syntax compare #356

Md syntax compare #356

Conversation

rousik commented Apr 20, 2020

github-actions bot commented Apr 20, 2020

matiasgarciaisaia left a comment

Choose a reason for hiding this comment

rousik commented Apr 23, 2020

rousik commented Apr 23, 2020