Skip to content

Latest commit

 

History

History
62 lines (46 loc) · 3.46 KB

File metadata and controls

62 lines (46 loc) · 3.46 KB

Information_Extraction_from_Biographies

An exploration on NLP methods for information extraction from biographies, with Extended Taipei Gazetteers.

Proposed NLP Methods Overview
   1. Name Entity Recognition
   2. Relation Extraction
   3. Weighted Cooccurrence Rank
   4. Automatic Timeline Generation
Usage
Github Wiki for more

Proposed NLP Methods Overview

We propose and implement some new NLP methods for information extraction.

1. Name Entity Recognition (NER)

Increase recall by using multiple NER tools with auxiliary information, then increase precision by applying some filters and principles.

diagram of proposed NER method

(detail...)

2. Relation Extraction

As a support of main relation extraction method, we can extract relation using grammar structure, based on the trait that biographee's name are usually omitted in the text.

Take biography of "王世慶" for example (under the assumption that we detect correct grammar structure)

table of proposed relation extraction method

(detail...)

3. Weighted Cooccurrence Rank

Calculate and rank cooccurrence score which is weighted on distance, delimiters and times between names, to find out really important cooccurence and unfound relations.

wieghted cooccurrence

(detail...)

4. Timeline Generation

Generate complete timeline using delimiter and some principles, or generate simple timeline using grammar structure.

ptimeline generation

(detail...)

Usage

Prerequisite

  1. Python3 (we develope with Python 3.6)
  2. pip insstall -r requirements.txt to install all required python packages
  3. MongoDB
  4. Stanford CoreNLP
    download main program and unzip it somewhere
    download Chinese model jar and move into the Stanford CoreNLP direcotry you just unzipped.

Execution

  1. Start MongoDB daemon.
    sudo service mongod start (in Ubuntu)
  2. Start CoreNLP server.
    in Stanford CoreNLP directory, execute command java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 15000
  3. Execute main pipeline process, and wait for several minutes.
    python3 main.py
  4. Results are in ./Database
    some results are also kept in MongoDB. (see Wiki:Data)
    note that graph result is store in .graphml format, you can import it to Gephi or Cytoscape or whatever you like