Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix: translate upenn treebank pos to wordnet pos
Problem: The function to translate between a upenn treebank part-of-speech tag and a wordnet part-of-speech tag, `get_wordnet_pos` universally returned `wn.NOUN`. Fix: `nltk.tag`'s `pos_tag` returns a upenn treebank pos, available values are listed in `nltk.help.upenn_tagset()` Nouns: - NN Noun, singular or mass - NNS Noun, plural - NNP Proper noun, singular - NNPS Proper noun, plural Verbs: - VB Verb, base form - VBD Verb, past tense - VBG Verb, gerund or present participle - VBN Verb, past participle - VBP Verb, non-3rd person singular present - VBZ Verb, 3rd person singular present Adjectives: - JJ Adjective - JJR Adjective, comparative - JJS Adjective, superlative Adverb: - RB Adverb - RBR Adverb, comparative - RBS Adverb, superlative - RP Particle The first letter of each maps to the wordnet lemmatizer pos. The fix is to only look at the first letter. This makes a small but noticable difference. Before: $ macroetym --showfamilies Latinate,Germanic moby-dick.txt moby-dick.txt Latinate 63.449849 Germanic 34.358548 After: $ macroetym --showfamilies Latinate,Germanic moby-dick.txt moby-dick.txt Latinate 62.542085 Germanic 35.540172 This is because previously conjugated verbs, e.g., `writhed` were treated as nouns but are now recognized as their base verb forms, e.g., `writhe` and able to be tagged, e.g., `writhe [wriþan (ang)]`.
- Loading branch information