Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing of UMLS subset #1

Open
afsluit opened this issue Aug 21, 2018 · 7 comments
Open

Indexing of UMLS subset #1

afsluit opened this issue Aug 21, 2018 · 7 comments
Assignees

Comments

@afsluit
Copy link

afsluit commented Aug 21, 2018

Hi,
I am following the Generating Tables section in the Design.md to create the index files for a custome subset of the UMLS. The CreateIndexes program runs without any errors (just the warnings about missing meshtcrelaxed.txt and vars.txt).

However, the irutils.MappedMultiKeyIndexLookup lookup command does not return any results when applied on the generated indexes. When applying the same command on the indices distributed with MetaMapLite, results are returned. When browsing the subset for which the indices are generated with Metamorphosys, the searched term is also found.

If there is any debug information I can provide, I'm more than happy to provide it. Thanks for your help!

@afsluit
Copy link
Author

afsluit commented Aug 31, 2018

I suppose the problem are umlauts and other characters that need more than one byte to represent. When replacing every non-ASCII character while generating the indices (e.g. in MultiKeyIndex::loadTable) with an ASCII character (e.g. 'x'), the lookup works (obviously not for terms containing the replaced characters, but only for ones containing ascii chars).

@willjrogers
Copy link
Collaborator

I've created the branch utf (https://github.com/lhncbc/metamaplite/tree/utf) that supports producing and querying indexes using UTF-8. Almost all of the classes in irutils had to be updated to support UTF-8. The UTF-8 implementation supports the old ASCII indexes. Also added unit tests to check if the irutils library is producing indexes correctly.

What's missing?

  1. Currently no support for generating lexical variants for terms containing UTF-8 for the chunker.
  2. Term normalization needs to updated to support UTF-8 diacritics to ASCII (ö -> o) and possibly the reverse and conversion of some Greek characters in chemicals (α -> alpha).
  3. Some input readers and result formatters need to updated to support reading and writing UTF-8.
  4. A mechanism for querying the index on its encoding needs to be added.

Future, merging initial UTF-8 support with master and expanding support to allow use of both UTF-8 and UTF-16.

@willjrogers willjrogers self-assigned this Feb 15, 2019
willjrogers added a commit that referenced this issue Feb 25, 2021
Fixed error where File resource was not closed when loading properties
from file.

Also updated incorrect examples.

This update addresses part #1 of issue #13
 (#13)
@stevenbedrick
Copy link
Contributor

stevenbedrick commented Apr 29, 2023

Hello! I have just run into this exact problem, and before filing my own issue thought to check the existing ones. I see that the utf branch is pretty far behind the main branch, has there been any more thought to making "mainstream" MetaMapLite support UTF-8?

For what it's worth, the current state of UTF-8 support on the main branch appears to be that the indexing code does work in that a correctly-formed index is produced and written to disk- the necessary places in the code are all charset-aware. The issue is with lookup, specifically MappedMultiKeyIndex.dictionaryBinarySearch(), line 434, which does not do Unicode-aware String comparison. Oddly, there is a commented-out line that would do a correct comparison, but the actual code that is being executed to do the comparison is looking at raw bytes and so is failing on UTF-8 that involves code points higher than U+007F.

@stevenbedrick
Copy link
Contributor

stevenbedrick commented May 2, 2023

Huh - on closer inspection, it looks like the line in question (which is 435, not 434, apologies) was actually added as part of commit da51d16, which was about adding UTF-8 support to the indexing part of MML. The commented-out line on 432, which (to my eye and according to my tests) does the correct Unicode-aware string comparison, was commented out as part of this revision in favor of the byte-level comparison whose misbehavior is at the root of this issue.

So this makes me think that I must be misunderstanding the situation. @willjrogers , any thoughts?

@willjrogers
Copy link
Collaborator

willjrogers commented May 2, 2023 via email

@stevenbedrick
Copy link
Contributor

stevenbedrick commented May 2, 2023

So irutils.MappedMultiKeyIndex does use the UTF-8 byte representation for the lookup key, which works for lookup keys that involve non-ASCII characters, but this leads to the binary search implementation breaking for certain other lookup keys in the same index shard which is causing the problem that I have run into.

It took me quite a while to figure this out, but the short version of the story is that the binary search will fail for keys that both a) share a common UTF-8 byte length as another word in MRCONSO.RRF that involve non-ASCII characters, and b) share a common prefix with that word (that common prefix being the run of text up to the first non-ASCII character in the word).

As you may imagine, this made it rather a tricky bug to track down, as the lookups that were failing did not include any non-ASCII Unicode text, and test strings that did use non-ASCII Unicode characters were working just fine, so that threw me off for while. But at this point I have a minimal MRCONSO.RRF file that will demonstrate this bug, and can confirm that doing the comparison in dictionaryBinarySearch() at the String level rather than the UTF-8 byte level solves the problem. Is there any reason not to do the comparison at the String level?

@stevenbedrick
Copy link
Contributor

Here might be another piece of the puzzle - I'm revisiting the code that controls how the various postings/term dictionary/etc. files are generated, and if I am understanding things correctly, everything depends on the fact that TreeMap allows for sorted iteration of keys. However, again, IIUC, that iteration happens according to the natural sort order as defined by the implementation of Comparable on the keys themselves - in this case, String's compareTo method. This would also explain why doing the comparison for binary search at the byte level (rather than at the string level) is failing - because the files are being created in string-sorted order not UTF-8-byte-sorted order.

stevenbedrick added a commit to stevenbedrick/metamaplite that referenced this issue May 6, 2023
…doing string comparison (rather than byte-level comparison). This is necessary because the binary search table's layout is being done according to Java's (Unicode-aware) string comparison order, not encoded byte comparison order, so the lookup needs to match. This fixes issue LHNCBC#1; see issue for more complete discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants