Solve nested entities problems by using SpanCategorizer #88
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Using
doc.spans["sc"]
(SpanCategorizer) to solve the problem of overlapped tokens in nested NER for spacy. By replacingdoc.ents
withdoc.spans["sc"]
, all possible entities are able to be stored without any errors.After storing all possible spans, we filter out overlapping spans before adding them to
doc.ents
. Here we remove overlapping spans usingspacy.util.filter_spans
. When spans overlap, the rule is to prefer the first longest span over shorter ones.