Skip to content

Latest commit

 

History

History
432 lines (298 loc) · 22.7 KB

README.md

File metadata and controls

432 lines (298 loc) · 22.7 KB

Japanese Company Lexicon (JCLdic)

This repository contains the implementation for the paper: High Coverage Lexicon for Japanese Company Name Recognition(ANLP 2020)

Download links

We provide two kinds of format. The CSV format contains one name per line, and the MeCab format contains one record per line. Users can directly open MeCab CSV format to check the record. The MeCab Dic format is compiled by MeCab, which can be used as the user dictionary of MeCab. MeCab Dic usage

Our goal is to build the enterprise knowledge graph, so we only consider the companies that conducts economic activity for commercial purposes. These companies are denoted as Stock Company (株式会社), Limited Company (有限会社), and Limited Liability Company (合同会社).

The full version contains all kinds of names, including digits, one character aliases, etc. These abnormal names will cause annotation error for NER task. We recommend use the JCL_medium version or JCL_slim version.

These release versions are easier to use than the version we used in the paper. Considering the trade-off between dictionary size and searching performance, we delete zenkaku(全角) names and only preserve the hankaku(半角) names. For example, we delete '株式会社KADOKAWA' but preserve '株式会社KADOKAWA'. As for the normalization process, please read the Python section in usage page.

Single Lexicon Total Names Unique Company Names
JCL-slim 7067216 7067216
JCL-medium 7555163 7555163
JCL-full 8491326 8491326
IPAdic 392126 16596
Juman 751185 9598
NEologd 3171530 244213
Multiple Lexicon
IPAdic-NEologd 4615340 257246
IPAdic-NEologd-JCL(medium) 12093988 7722861

Usage

See wiki page for detail usage.

JCLdic Generation Process

Instead of downloading the data, you can even build the JCLdic from scratch by following the below instructions.

Data Preparation

# conda create -n jcl python=3.6
# source activate jcl
pip install -r requirements.txt

If you want to download the data by Selenium, you have to download the ChromeDriver. First check your Chrome version, and then download the corresponding version of ChromeDriver from here.

Uncompressing ZIP file to get chromedriver, then move it to target directory:

cd $HOME/Downloads
unzip chromedriver_mac64.zip 
mv chromedriver /usr/local/bin

We create JCLdic according to the original data from National Tax Agency Corporate Number Publication Site (国税庁法人番号公表サイト). Please download the ZIP files data from the below site:

Put the ZIP files to data/hojin/zip directory, and run below script to preprocess the data:

bash scripts/download.sh

Below directories will be generated automatically, but you need to create data/hojin/zip directory manually to store the ZIP files in the first place.

.
├── data
│   ├── corpora 
│   │   ├── bccwj             # raw dataset
│   │   ├── mainichi          # raw dataset
│   │   └── output            # processed bccwj and mainichi dataset as IBO2 format
│   ├── dictionaries
│   │   ├── ipadic            # raw lexicon
│   │   ├── neologd           # raw lexicon
│   │   ├── juman             # raw lexicon
│   │   └── output            # processed lexicons
│   └── hojin
│       ├── csv               # downloaded hojin data
│       ├── output            # processed JCLdic
│       └── zip               # downloaded hojin data

JCLdic Generation

Generating alias

bash scripts/generate_alias.sh

Until now, the JCLdic is prepared.

If you want to get the MeCab format:

python tools/save_mecab_format.py

Evaluation

Below result is based on the latest version of JCLdic, which might be different with the performance of the paper reported.

Datasets, dictionaries, and annotated datasets preparation

Because these datasets (Mainichi, BCCWJ) are not free, you should get the datasets by yourself. After you get the datasets, put them to data/corpora/{bccwj,mainichi} and run the below command:

# 1 Datasets preparation
python tools/dataset_converter.py # Read data from .xml, .sgml to .tsv
python tools/dataset_preprocess.py # Generate .bio data

If you want to compare other dictionaries, you could download it from below links and put them to data/dictionaries/{ipadic,jumman,neologd}:

# ipadic
# https://github.com/taku910/mecab/tree/master/mecab-ipadic

# juman
# https://github.com/taku910/mecab/tree/master/mecab-jumandic

# neologd
# https://github.com/neologd/mecab-ipadic-neologd/blob/master/seed/mecab-user-dict-seed.20200109.csv.xz

# 2 Prepare dictionaries 
python tools/dictionary_preprocess.py
# 3 Annotate datasets with different dictionaries 
python tools/annotation_with_dict.py

Intrinsic Evaluation: Coverage

Calculate coverage:

python tools/coverage.py

The intrinsic evaluation is calculate how many company names in different lexicons. The best results are highlighted.

Single Lexicon Mainichi BCCWJ
Count Coverage Count Coverage
JCL-slim 727 0.4601 419 0.4671
JCL-medium 730 0.4620 422 0.4705
JCL-full 805 0.5095 487 0.5429
IPAdic 726 0.4595 316 0.3523
Juman 197 0.1247 133 0.1483
NEologd 424 0.2684 241 0.2687
Multiple Lexicon
IPAdic-NEologd 839 0.5310 421 0.4693
IPAdic-neologd-JCL(medium) 1064 0.6734 568 0.6332

Extrinsic Evaluation: NER task

Make sure the main.py has following setting:

# main.py setting
entity_level = False 
# ...
### result 1 ###
# bccwj  
main(bccwj_paths, bccwj_glod, entity_level=entity_level)
# mainichi
main(mainichi_paths, mainichi_glod, entity_level=entity_level)

Run the below command:

python main.py

The extrinsic evaluation is using using the NER taks to measure different lexicon performance. We annotate training set with different lexicons, train the model (CRF and Bi-LSTM-CRF), and test on the test set. The Glod means we train the model with true labels. The best result is highlighted.

Following table is the extrinsic evaluation result. The best results are highlighted.

Single Lexicon Mainichi F1 BCCWJ F1
CRF Bi-LSTM-CRF CRF Bi-LSTM-CRF
Gold 0.9756 0.9683 0.9273 0.8911
JCL-slim 0.8533 0.8708 0.8506 0.8484
JCL-meidum 0.8517 0.8709 0.8501 0.8526
JCL-full 0.5264 0.5792 0.5646 0.7028
Juman 0.8865 0.8905 0.8320 0.8169
IPAdic 0.9048 0.9141 0.8646 0.8334
NEologd 0.8975 0.9066 0.8453 0.8288
Multiple Lexicon
IPAdic-NEologd 0.8911 0.9074 0.8624 0.8360
IPAdic-NEologd-JCL(medium) 0.8335 0.8752 0.8530 0.8524

Extra Experiment

Dictionary annotation as feature on token level

The new experiment results are in the parentheses. We use the dictionary annotation as CRF feature, and the best results are highlighted. The result shows that the dictionary feature boost the performance, especially the JCL.

Single Lexicon Mainichi F1 BCCWJ F1
CRF CRF
Gold 0.9756 (1) 0.9273 (1)
JCL-slim 0.8533 (0.9754) 0.8506 (0.9339)
JCL-meidum 0.8517 (0.9752) 0.8501 (0.9303)
JCL-full 0.5264 (0.9764) 0.5646 (0.9364)
Juman 0.8865 (0.9754) 0.8320 (0.9276)
IPAdic 0.9048 (0.9758) 0.8646 (0.9299)
NEologd 0.8975 (0.9750) 0.8453 (0.9282)
Multiple Lexicon
IPAdic-NEologd 0.8911 (0.9767) 0.8624 (0.9366)
IPAdic-NEologd-JCL(medium) 0.8335 (0.9759) 0.8530 (0.9334)

Dictionary annotation as feature on entity level

Make sure the main.py has following setting:

# main.py setting
entity_level = True 
# ...
### result 1 ###
# bccwj  
main(bccwj_paths, bccwj_glod, entity_level=entity_level)
# mainichi
main(mainichi_paths, mainichi_glod, entity_level=entity_level)

### result 2 ###
# bccwj: use dictionary as feature for CRF
crf_tagged_pipeline(bccwj_paths, bccwj_glod, entity_level=entity_level)
# mainichi: use dictionary as feature for CRF       
crf_tagged_pipeline(mainichi_paths, mainichi_glod, entity_level=entity_level) 

Run the below command:

python main.py

The entity level result:

  • result1 : train the data on the labels that tagged by dictionary
  • result2 : add the dictionary tag as feature for CRF, use the true label for training
Single Lexicon Mainichi F1 (CRF) Mainichi F1 (CRF) BCCWJ F1 (CRF) BCCWJ F1 (CRF)
Result1 Result2 Result1 Result2
Gold 0.7826 0.5537
JCL-slim 0.1326 0.7969 0.1632 0.5892
JCL-meidum 0.1363 0.7927 0.1672 0.5813
JCL-full 0.0268 0.8039 0.0446 0.6205
Juman 0.0742 0.7923 0.0329 0.5661
IPAdic 0.3099 0.7924 0.1605 0.5961
NEologd 0.1107 0.7897 0.0814 0.5718
Multiple Lexicon
IPAdic-NEologd 0.2456 0.7986 0.1412 0.6187
IPAdic-NEologd-JCL(medium) 0.1967 0.8009 0.2166 0.6132

From result1 and result2, we can see these dictionary are not suitable for annotating training label, but the dictionary feature do improve the performance in result2.

Dictionary feature for low frequency company names on entity level

We first divide the result into 3 categories:

Category Description Evaluation
Zero the entity not exist in the training set Zero-shot, performance on unseen entity
One the entity only exists once in the training set One-shot, performance on low frequency entity
More the entity exists many times in the training set Training on normal data

The dataset statistics:

Dataset BCCWJ Mainichi
Company Samples/Sentence 1364 3027
Company Entities 1704 4664
Unique Company Entities 897 1580
Number of Unique Company
Entities Exist in Training Set
Zero: 226
One: 472
More: 199
Zero: 1440
One: 49
More: 91

The experiment results:

Single Lexicon BCCWJ
F1(CRF)
Mainichi
F1(CRF)
Zero One More Zero One More
Gold 0.4080 0.8211 0.9091 0.4970 0.8284 0.9353
JCL-slim 0.4748 0.8333 0.9091 0.5345 0.8075 0.9509
JCL-meidum 0.4530 0.8660 0.9091 0.5151 0.8061 0.9503
JCL-full 0.5411 0.8333 0.8933 0.5630 0.8467 0.9476
Juman 0.4506 0.7957 0.9032 0.5113 0.8655 0.9431
IPAdic 0.4926 0.8421 0.9161 0.5369 0.8633 0.9419
NEologd 0.4382 0.8454 0.9161 0.5343 0.8456 0.9359
Multiple Lexicon
IPAdic-NEologd 0.5276 0.8600 0.9091 0.5556 0.8623 0.9432
IPAdic-NEologd-JCL(medium) 0.5198 0.8421 0.8947 0.5484 0.8487 0.9476

From the result above, we can see JCLdic boost the zero-shot and one-shot performance a lot, especially on the BCCWJ dataset.

Citation

Please use the following bibtex, when you refer JCLdic from your papers.

@INPROCEEDINGS{liang2020jcldic,
    author    = {Xu Liang, Taniguchi Yasufumi and Nakayama Hiroki},
    title     = {High Coverage Lexicon for Japanese Company Name Recognition},
    booktitle = {Proceedings of the Twenty-six Annual Meeting of the Association for Natural Language Processing},
    year      = {2020},
    pages     = {NLP2020-B2-3},
    publisher = {The Association for Natural Language Processing},
}