Switch to spaCy as the default parser #4

lukehsiao · 2018-02-07T06:45:04Z

Support using spaCy as the lingual parser for the old parser (i.e. the one that does not support pdftotree output).

TODO:

Upgrade to spaCy 2.x (Upgrade to spaCy 2.x #9)
Compare features pre and post spaCy
check visual linker for mismatches. Update: (see Word mismatch between HTML and PDF for visual linker #12). However, it looks like we don't have unicode issues.

lukehsiao · 2018-02-07T21:16:12Z

Output from CoreNLP on the simple documents:

---------------------------------------- Captured log call -----------------------------------------
[DEBUG] Starting new HTTP connection (1): 127.0.0.1
[DEBUG] http://127.0.0.1:12345 "POST /?properties=%7B%22annotators%22:%20%22tokenize,ssplit,pos,lemma,depparse,ner%22,%22outputFormat%22:%20%22json%22,%22tokenize.options%22:%22escapeForwardSlashAsterisk=false,asciiQuotes=false,unicodeQuotes=false,normalizeOtherBrackets=fal
se,ptb3Ellipsis=false,normalizeParentheses=false,normalizeCurrency=false,unicodeEllipsis=false,latexQuotes=false,normalizeSpace=false,strictTreebank3=true,ptb3Dashes=false,normalizeFractions=false%22,%22ssplit.htmlBoundariesToDiscard%22:%20%22NB%22%7D HTTP/1.1" 200 31713
[DEBUG] http://127.0.0.1:12345 "POST /?properties=%7B%22annotators%22:%20%22tokenize,ssplit,pos,lemma,depparse,ner%22,%22outputFormat%22:%20%22json%22,%22tokenize.options%22:%22escapeForwardSlashAsterisk=false,asciiQuotes=false,unicodeQuotes=false,normalizeOtherBrackets=fal
se,ptb3Ellipsis=false,normalizeParentheses=false,normalizeCurrency=false,unicodeEllipsis=false,latexQuotes=false,normalizeSpace=false,strictTreebank3=true,ptb3Dashes=false,normalizeFractions=false%22,%22ssplit.htmlBoundariesToDiscard%22:%20%22NB%22%7D HTTP/1.1" 200 62272
[DEBUG] Doc: diseases
[DEBUG]   Phrase: Types of viruses, coughs, and colds
[DEBUG]   Phrase: Here isa line break
[DEBUG]   Phrase: I don't have Brain Canceror the hiccups
[DEBUG]   Phrase: See Table 1 Below.
[DEBUG]   Phrase: Common Ailments
[DEBUG]   Phrase: In between the tables there is a nasty case of heart attack
[DEBUG]   Phrase: And here is a final sentence with warts.
[DEBUG]   Phrase: Table 1: Infectious diseases and where to find them.
[DEBUG]   Phrase: Table 2: Three ways to get Pneumonia and how much they cost.
[DEBUG]   Phrase: Disease
[DEBUG]   Phrase: Location
[DEBUG]   Phrase: Year
[DEBUG]   Phrase: Polio and BC546 is -55OC cold.
[DEBUG]   Phrase: -Dublin to Milwaukee
[DEBUG]   Phrase: 2001
[DEBUG]   Phrase: I don't like TIPL761 or Chicken Pox or pizza.
[DEBUG]   Phrase: Shingles is also bad.
[DEBUG]   Phrase: whooping cough
[DEBUG]   Phrase: 2009
[DEBUG]   Phrase: Scurvy
[DEBUG]   Phrase: Annapolis
[DEBUG]   Phrase: Junction and Storage Temperature -55 to 150 o ?
[DEBUG]   Phrase: C
[DEBUG]   Phrase: Problem
[DEBUG]   Phrase: Cause
[DEBUG]   Phrase: Cost
[DEBUG]   Phrase: Arthritis
[DEBUG]   Phrase: Pokemon Go
[DEBUG]   Phrase: Free
[DEBUG]   Phrase: Yellow
[DEBUG]   Phrase: Fever
[DEBUG]   Phrase: Unicorns
[DEBUG]   Phrase: $17.75
[DEBUG]   Phrase: Hypochondria
[DEBUG]   Phrase: Fear
[DEBUG]   Phrase: $100
[DEBUG] Doc: md
[DEBUG]   Phrase: Sample Markdown
[DEBUG]   Phrase: This is some basic, sample markdown.
[DEBUG]   Phrase: Second Heading
[DEBUG]   Phrase: Unordered lists, and:
[DEBUG]   Phrase: One
[DEBUG]   Phrase: Two
[DEBUG]   Phrase: Three
[DEBUG]   Phrase: More
[DEBUG]   Phrase: Blockquote
[DEBUG]   Phrase: And
[DEBUG]   Phrase: bold
[DEBUG]   Phrase: ,
[DEBUG]   Phrase: italics
[DEBUG]   Phrase: , and even
[DEBUG]   Phrase: italics and later
[DEBUG]   Phrase: .
[DEBUG]   Phrase: Even
[DEBUG]   Phrase: bold
[DEBUG]   Phrase: strikethrough
[DEBUG]   Phrase: .
[DEBUG]   Phrase: A link
[DEBUG]   Phrase: to somewhere.
[DEBUG]   Phrase: Here is a table
[DEBUG]   Phrase: Or inline code like
[DEBUG]   Phrase: var foo = 'bar';
[DEBUG]   Phrase: .
[DEBUG]   Phrase: Or an image of bears
[DEBUG]   Phrase: The end ...
[DEBUG]   Phrase: Name
[DEBUG]   Phrase: Lunch order
[DEBUG]   Phrase: Spicy
[DEBUG]   Phrase: Owes
[DEBUG]   Phrase: Joan
[DEBUG]   Phrase: saag paneer
[DEBUG]   Phrase: medium
[DEBUG]   Phrase: $11
[DEBUG]   Phrase: Sally
[DEBUG]   Phrase: vindaloo
[DEBUG]   Phrase: mild
[DEBUG]   Phrase: $14
[DEBUG]   Phrase: Erin
[DEBUG]   Phrase: lamb madras
[DEBUG]   Phrase: HOT
[DEBUG]   Phrase: $5

CoreNLP is splitting different formatting (e.g. italics, bold, etc) into different phrases.

lukehsiao · 2018-02-10T23:28:46Z

Inspecting 5 candidates using the code:

from fonduer.features import features

cand = []

log = open('scapy_log_features.txt', 'w')

for i, c in enumerate(train_cands):
    if c[0].get_span().startswith('BC856') and c[1].get_span() == '150':
        print("###", i)
        cand.append(c)

print("Candidates: {}".format(len(cand)))
        
for c in cand:
    log.write("Candidate: {}\n".format(c))
    for f in list(features.get_all_feats([c])):
        log.write("    Feature: {}\n".format(f))

log.close()

at the end of the stg_temp_max tutorial.

Closes #4.

* Update wording of content * Remove paleo tutorial The paleo dataset is too noisy to perform stably without much more data that would be reasonable to run in this tutorial. Moving it to a separate branch.

lukehsiao self-assigned this Feb 7, 2018

lukehsiao added the enhancement New feature or request label Feb 7, 2018

lukehsiao changed the title ~~Switch to spacy as the default parser~~ Switch to spaCy as the default parser Feb 7, 2018

lukehsiao mentioned this issue Feb 8, 2018

Switch to spaCy as the lingual parser #7

Merged

2 tasks

lukehsiao closed this as completed in #7 Feb 11, 2018

lukehsiao added a commit that referenced this issue Feb 11, 2018

Switch to spaCy as the lingual parser

e695283

Closes #4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to spaCy as the default parser #4

Switch to spaCy as the default parser #4

lukehsiao commented Feb 7, 2018 •

edited

Loading

lukehsiao commented Feb 7, 2018 •

edited

Loading

lukehsiao commented Feb 10, 2018 •

edited

Loading

Switch to spaCy as the default parser #4

Switch to spaCy as the default parser #4

Comments

lukehsiao commented Feb 7, 2018 • edited Loading

lukehsiao commented Feb 7, 2018 • edited Loading

lukehsiao commented Feb 10, 2018 • edited Loading

lukehsiao commented Feb 7, 2018 •

edited

Loading

lukehsiao commented Feb 7, 2018 •

edited

Loading

lukehsiao commented Feb 10, 2018 •

edited

Loading