Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to spaCy as the default parser #4

Closed
3 tasks done
lukehsiao opened this issue Feb 7, 2018 · 2 comments · Fixed by #7
Closed
3 tasks done

Switch to spaCy as the default parser #4

lukehsiao opened this issue Feb 7, 2018 · 2 comments · Fixed by #7
Assignees
Labels
enhancement New feature or request

Comments

@lukehsiao
Copy link
Contributor

lukehsiao commented Feb 7, 2018

Support using spaCy as the lingual parser for the old parser (i.e. the one that does not support pdftotree output).

TODO:

@lukehsiao lukehsiao self-assigned this Feb 7, 2018
@lukehsiao lukehsiao added the enhancement New feature or request label Feb 7, 2018
@lukehsiao lukehsiao changed the title Switch to spacy as the default parser Switch to spaCy as the default parser Feb 7, 2018
@lukehsiao
Copy link
Contributor Author

lukehsiao commented Feb 7, 2018

Output from CoreNLP on the simple documents:

---------------------------------------- Captured log call -----------------------------------------
[DEBUG] Starting new HTTP connection (1): 127.0.0.1
[DEBUG] http://127.0.0.1:12345 "POST /?properties=%7B%22annotators%22:%20%22tokenize,ssplit,pos,lemma,depparse,ner%22,%22outputFormat%22:%20%22json%22,%22tokenize.options%22:%22escapeForwardSlashAsterisk=false,asciiQuotes=false,unicodeQuotes=false,normalizeOtherBrackets=fal
se,ptb3Ellipsis=false,normalizeParentheses=false,normalizeCurrency=false,unicodeEllipsis=false,latexQuotes=false,normalizeSpace=false,strictTreebank3=true,ptb3Dashes=false,normalizeFractions=false%22,%22ssplit.htmlBoundariesToDiscard%22:%20%22NB%22%7D HTTP/1.1" 200 31713
[DEBUG] http://127.0.0.1:12345 "POST /?properties=%7B%22annotators%22:%20%22tokenize,ssplit,pos,lemma,depparse,ner%22,%22outputFormat%22:%20%22json%22,%22tokenize.options%22:%22escapeForwardSlashAsterisk=false,asciiQuotes=false,unicodeQuotes=false,normalizeOtherBrackets=fal
se,ptb3Ellipsis=false,normalizeParentheses=false,normalizeCurrency=false,unicodeEllipsis=false,latexQuotes=false,normalizeSpace=false,strictTreebank3=true,ptb3Dashes=false,normalizeFractions=false%22,%22ssplit.htmlBoundariesToDiscard%22:%20%22NB%22%7D HTTP/1.1" 200 62272
[DEBUG] Doc: diseases
[DEBUG]   Phrase: Types of viruses, coughs, and colds
[DEBUG]   Phrase: Here isa line break
[DEBUG]   Phrase: I don't have Brain Canceror the hiccups
[DEBUG]   Phrase: See Table 1 Below.
[DEBUG]   Phrase: Common Ailments
[DEBUG]   Phrase: In between the tables there is a nasty case of heart attack
[DEBUG]   Phrase: And here is a final sentence with warts.
[DEBUG]   Phrase: Table 1: Infectious diseases and where to find them.
[DEBUG]   Phrase: Table 2: Three ways to get Pneumonia and how much they cost.
[DEBUG]   Phrase: Disease
[DEBUG]   Phrase: Location
[DEBUG]   Phrase: Year
[DEBUG]   Phrase: Polio and BC546 is -55OC cold.
[DEBUG]   Phrase: -Dublin to Milwaukee
[DEBUG]   Phrase: 2001
[DEBUG]   Phrase: I don't like TIPL761 or Chicken Pox or pizza.
[DEBUG]   Phrase: Shingles is also bad.
[DEBUG]   Phrase: whooping cough
[DEBUG]   Phrase: 2009
[DEBUG]   Phrase: Scurvy
[DEBUG]   Phrase: Annapolis
[DEBUG]   Phrase: Junction and Storage Temperature -55 to 150 o ?
[DEBUG]   Phrase: C
[DEBUG]   Phrase: Problem
[DEBUG]   Phrase: Cause
[DEBUG]   Phrase: Cost
[DEBUG]   Phrase: Arthritis
[DEBUG]   Phrase: Pokemon Go
[DEBUG]   Phrase: Free
[DEBUG]   Phrase: Yellow
[DEBUG]   Phrase: Fever
[DEBUG]   Phrase: Unicorns
[DEBUG]   Phrase: $17.75
[DEBUG]   Phrase: Hypochondria
[DEBUG]   Phrase: Fear
[DEBUG]   Phrase: $100
[DEBUG] Doc: md
[DEBUG]   Phrase: Sample Markdown
[DEBUG]   Phrase: This is some basic, sample markdown.
[DEBUG]   Phrase: Second Heading
[DEBUG]   Phrase: Unordered lists, and:
[DEBUG]   Phrase: One
[DEBUG]   Phrase: Two
[DEBUG]   Phrase: Three
[DEBUG]   Phrase: More
[DEBUG]   Phrase: Blockquote
[DEBUG]   Phrase: And
[DEBUG]   Phrase: bold
[DEBUG]   Phrase: ,
[DEBUG]   Phrase: italics
[DEBUG]   Phrase: , and even
[DEBUG]   Phrase: italics and later
[DEBUG]   Phrase: .
[DEBUG]   Phrase: Even
[DEBUG]   Phrase: bold
[DEBUG]   Phrase: strikethrough
[DEBUG]   Phrase: .
[DEBUG]   Phrase: A link
[DEBUG]   Phrase: to somewhere.
[DEBUG]   Phrase: Here is a table
[DEBUG]   Phrase: Or inline code like
[DEBUG]   Phrase: var foo = 'bar';
[DEBUG]   Phrase: .
[DEBUG]   Phrase: Or an image of bears
[DEBUG]   Phrase: The end ...
[DEBUG]   Phrase: Name
[DEBUG]   Phrase: Lunch order
[DEBUG]   Phrase: Spicy
[DEBUG]   Phrase: Owes
[DEBUG]   Phrase: Joan
[DEBUG]   Phrase: saag paneer
[DEBUG]   Phrase: medium
[DEBUG]   Phrase: $11
[DEBUG]   Phrase: Sally
[DEBUG]   Phrase: vindaloo
[DEBUG]   Phrase: mild
[DEBUG]   Phrase: $14
[DEBUG]   Phrase: Erin
[DEBUG]   Phrase: lamb madras
[DEBUG]   Phrase: HOT
[DEBUG]   Phrase: $5

CoreNLP is splitting different formatting (e.g. italics, bold, etc) into different phrases.

@lukehsiao
Copy link
Contributor Author

lukehsiao commented Feb 10, 2018

Inspecting 5 candidates using the code:

from fonduer.features import features

cand = []

log = open('scapy_log_features.txt', 'w')

for i, c in enumerate(train_cands):
    if c[0].get_span().startswith('BC856') and c[1].get_span() == '150':
        print("###", i)
        cand.append(c)

print("Candidates: {}".format(len(cand)))
        
for c in cand:
    log.write("Candidate: {}\n".format(c))
    for f in list(features.get_all_feats([c])):
        log.write("    Feature: {}\n".format(f))

log.close()

at the end of the stg_temp_max tutorial.

lukehsiao added a commit that referenced this issue Feb 11, 2018
stackoverflowed pushed a commit to stackoverflowed/multimodal that referenced this issue Dec 4, 2021
* Update wording of content

* Remove paleo tutorial

The paleo dataset is too noisy to perform stably without much more data
that would be reasonable to run in this tutorial. Moving it to a
separate branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant