-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing of UMLS subset #1
Comments
I suppose the problem are umlauts and other characters that need more than one byte to represent. When replacing every non-ASCII character while generating the indices (e.g. in MultiKeyIndex::loadTable) with an ASCII character (e.g. 'x'), the lookup works (obviously not for terms containing the replaced characters, but only for ones containing ascii chars). |
I've created the branch What's missing?
Future, merging initial UTF-8 support with master and expanding support to allow use of both UTF-8 and UTF-16. |
Hello! I have just run into this exact problem, and before filing my own issue thought to check the existing ones. I see that the For what it's worth, the current state of UTF-8 support on the main branch appears to be that the indexing code does work in that a correctly-formed index is produced and written to disk- the necessary places in the code are all charset-aware. The issue is with lookup, specifically |
Huh - on closer inspection, it looks like the line in question (which is 435, not 434, apologies) was actually added as part of commit da51d16, which was about adding UTF-8 support to the indexing part of MML. The commented-out line on 432, which (to my eye and according to my tests) does the correct Unicode-aware string comparison, was commented out as part of this revision in favor of the byte-level comparison whose misbehavior is at the root of this issue. So this makes me think that I must be misunderstanding the situation. @willjrogers , any thoughts? |
I believe the document loaders in the package gov.nih.nlm.nls.metamap.document handle loading documents in UTF-8. Unfortunately, the result formatters (in gov.nih.nlm.nls.metamap.document) have not been updated to support UTF-8 except when called by the method MetaMapLite:listEntities which does open the PrintWriter using the UTF-8 charset. When looking up terms, the class irutils.MappedMultiKeyIndex does encode Strings to Bytes using “UTF-8” encoding by default.
Best,
Willie Rogers
--
Willie Rogers,
Contractor, Lister Hill Center, National Library of Medicine, NIH
email: ***@***.******@***.***>
Rogers, Willie (NIH/NLM/LHC) [C]
From: Steven Bedrick ***@***.***>
Date: Monday, May 1, 2023 at 8:10 PM
To: lhncbc/metamaplite ***@***.***>
Cc: Rogers, Willie (NIH/NLM/LHC) [C] ***@***.***>, Mention ***@***.***>
Subject: [EXTERNAL] Re: [lhncbc/metamaplite] Indexing of UMLS subset (#1)
Huh - on closer inspection, it looks like the line in question (which is 435, not 434, apologies) was actually added as part of commit da51d16<da51d16>, which was about adding UTF-8 support to the indexing part of MML. The commented-out line on 432, which (to my eye and according to my tests) does the correct Unicode-aware string comparison, was commented out as part of this revision in favor of the byte-level comparison whose misbehavior is at the root of this issue.
So this makes me think that I must be misunderstanding the situation. @willjrogers<https://github.com/willjrogers> , any thoughts?
—
Reply to this email directly, view it on GitHub<#1 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHOVZBMWNDJ3FFMGE7WKUR3XEBGGZANCNFSM4FQT26VQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.
|
So irutils.MappedMultiKeyIndex does use the UTF-8 byte representation for the lookup key, which works for lookup keys that involve non-ASCII characters, but this leads to the binary search implementation breaking for certain other lookup keys in the same index shard which is causing the problem that I have run into. It took me quite a while to figure this out, but the short version of the story is that the binary search will fail for keys that both a) share a common UTF-8 byte length as another word in MRCONSO.RRF that involve non-ASCII characters, and b) share a common prefix with that word (that common prefix being the run of text up to the first non-ASCII character in the word). As you may imagine, this made it rather a tricky bug to track down, as the lookups that were failing did not include any non-ASCII Unicode text, and test strings that did use non-ASCII Unicode characters were working just fine, so that threw me off for while. But at this point I have a minimal MRCONSO.RRF file that will demonstrate this bug, and can confirm that doing the comparison in |
Here might be another piece of the puzzle - I'm revisiting the code that controls how the various postings/term dictionary/etc. files are generated, and if I am understanding things correctly, everything depends on the fact that TreeMap allows for sorted iteration of keys. However, again, IIUC, that iteration happens according to the natural sort order as defined by the implementation of Comparable on the keys themselves - in this case, String's |
…doing string comparison (rather than byte-level comparison). This is necessary because the binary search table's layout is being done according to Java's (Unicode-aware) string comparison order, not encoded byte comparison order, so the lookup needs to match. This fixes issue LHNCBC#1; see issue for more complete discussion.
Hi,
I am following the Generating Tables section in the Design.md to create the index files for a custome subset of the UMLS. The CreateIndexes program runs without any errors (just the warnings about missing meshtcrelaxed.txt and vars.txt).
However, the irutils.MappedMultiKeyIndexLookup lookup command does not return any results when applied on the generated indexes. When applying the same command on the indices distributed with MetaMapLite, results are returned. When browsing the subset for which the indices are generated with Metamorphosys, the searched term is also found.
If there is any debug information I can provide, I'm more than happy to provide it. Thanks for your help!
The text was updated successfully, but these errors were encountered: