Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate-parlamint speedup #846

Open
matyaskopp opened this issue Feb 28, 2024 · 7 comments · May be fixed by #894
Open

validate-parlamint speedup #846

matyaskopp opened this issue Feb 28, 2024 · 7 comments · May be fixed by #894
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@matyaskopp
Copy link
Collaborator

I have been exploring why the validation is so slow.

jing

jing allows to validation of multiple files with the same schema in parallel. These are the speeds for 64 thread CPU, in seconds:

# of files loading schema validating total time
1 .335 .751 1.086
112 .314 25.774 26.088
297 .338 59.574 59.912
1149 .335 24.8152 248.487

We can speed up jing 5 times, but the order of output will be different - not file by file.

@matyaskopp matyaskopp added the enhancement New feature or request label Feb 28, 2024
@matyaskopp matyaskopp self-assigned this Feb 28, 2024
@matyaskopp
Copy link
Collaborator Author

We can speed up jing 5 times, but the order of output will be different - not file by file.

@TomazErjavec do we insist on this order?

@TomazErjavec
Copy link
Collaborator

I actually don't think jing is the bottleneck, rather, it is the XSLT validation that is slow. Also, validate-parlamint.pl takes file one by one, so it would be difficult to just do jing in parallel. In short, I don't think its worth trying to give jing multiple files.

@matyaskopp
Copy link
Collaborator Author

I actually don't think jing is the bottleneck, rather, it is the XSLT validation that is slow. Also, validate-parlamint.pl takes file one by one, so it would be difficult to just do jing in parallel. In short, I don't think its worth trying to give jing multiple files.

I have tried it, and validate-parlamint is about 25% faster (tested on LV) with Jing passing multiple files to jing.

@TomazErjavec
Copy link
Collaborator

about 25% faster (tested on LV)

ok, but I still think it is not worth it given the other problems with this approach. This might save 10% processing time, if that.

with Jing passing multiple files to jing.

Huh?

@matyaskopp
Copy link
Collaborator Author

Ok, I have staged my changes.

Another space for speeding up is the link-checker: Transform teiCorpus/teiHeader to a smaller temporary XML file, which contains just a list of elements with IDs - the parsing of this file can be faster, but the impact will be small too...
So no speedup, and moving to the future...

@matyaskopp matyaskopp added this to the Future milestone Mar 1, 2024
@TomazErjavec
Copy link
Collaborator

Another space for speeding up is the link-checker: Transform teiCorpus/teiHeader to a smaller temporary XML file, which contains just a list of elements with IDs - the parsing of this file can be faster, but the impact will be small too...

Yes, I think very small - the complete teiHeader (with everything XIncluded) fits into memory of any computer strong enough to process the corpus.

@matyaskopp matyaskopp linked a pull request Jan 20, 2025 that will close this issue
matyaskopp added a commit that referenced this issue Jan 20, 2025
@matyaskopp
Copy link
Collaborator Author

Parallel currently reports this warning on 60 threads:

INFO: Char validation for ParlaMint-IL_2004-01-01-16ptv487015.ana.xml
INFO: XML validation for ParlaMint-IL_2004-01-01-16ptv487015.ana.xml
INFO: XML validation for ParlaMint-IL_2004-01-01-16ptv487015.ana.xml
INFO: Content validaton for ParlaMint-IL_2004-01-01-16ptv487015.ana.xml
INFO: Link checking for ParlaMint-IL_2004-01-01-16ptv487015.ana.xml
parallel: Warning: No more file handles. 
parallel: Warning: Try running 'parallel -j0 -N 100 --pipe parallel -j0'
parallel: Warning: or increasing 'ulimit -n' (try: ulimit -n `ulimit -Hn`)
parallel: Warning: or increasing 'nofile' in /etc/security/limits.conf
parallel: Warning: or increasing /proc/sys/fs/file-max
INFO: Validating component TEI.ana file /lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/2004/ParlaMint-IL_2004-01-04-16ptm533605.ana.xml
INFO: Char validation for ParlaMint-IL_2004-01-04-16ptm533605.ana.xml
INFO: XML validation for ParlaMint-IL_2004-01-04-16ptm533605.ana.xml
/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/2004/ParlaMint-IL_2004-01-04-16ptm533605.ana.xml:122023:118: error: value of attribute "lemma" is invalid; must be a string matching the regular expression "(\S)|(\S[\S ]*\S)"
/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/2004/ParlaMint-IL_2004-01-04-16ptm533605.ana.xml:1185883:106: error: value of attribute "lemma" is invalid; must be a string matching the regular expression "(\S)|(\S[\S ]*\S)"
/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/2004/ParlaMint-IL_2004-01-04-16ptm533605.ana.xml:1364748:107: error: value of attribute "lemma" is invalid; must be a string matching the regular expression "(\S)|(\S[\S ]*\S)"
/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/2004/ParlaMint-IL_2004-01-04-16ptm533605.ana.xml:1433903:104: error: value of attribute "lemma" is invalid; must be a string matching the regular expression "(\S)|(\S[\S ]*\S)"
INFO: XML validation for ParlaMint-IL_2004-01-04-16ptm533605.ana.xml
INFO: Content validaton for ParlaMint-IL_2004-01-04-16ptm533605.ana.xml
INFO: Link checking for ParlaMint-IL_2004-01-04-16ptm533605.ana.xml
INFO: Validating component TEI.ana file /lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/2004/ParlaMint-IL_2004-01-04-16ptv547034.ana.xml

it seems that it does not influence the validation process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants