Skip to content

Commit

Permalink
Update for lbag.
Browse files Browse the repository at this point in the history
  • Loading branch information
zzsfornlp committed Sep 20, 2020
1 parent e446114 commit c58ccd9
Show file tree
Hide file tree
Showing 42 changed files with 5,788 additions and 1 deletion.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ How to configurate, generally: [here](docs/conf.md)

Related works:

"An Empirical Exploration of Local Ordering Pre-training for Structured Prediction": [TODO](??)

"A Two-Step Approach for Implicit Event Argument Detection": [details](docs/iarg.md)

Some other parsers for interested readers: [details](docs/sop.md)
Expand Down
68 changes: 68 additions & 0 deletions docs/lbag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
### For the Local Bag pre-training method

Hi, this describes our implementation for our work: "An Empirical Exploration of Local Ordering Pre-training for Structured Prediction".

Please refer to the paper for more details: [[paper]](TODO) [[bib]](TODO)

### Repo

When we were carrying out our experiments for this work, we used the repo at this commit [`here`](TODO). In later versions of this repo, there may be slight changes (for example, default hyper-parameter change or hyper-parameter name change).

### Environment

As those of the main `msp` package:

python>=3.6
dependencies: pytorch>=1.0.0, numpy, scipy, gensim, cython, transformers, ...

### Data

- Pre-training data: any large corpus can be utilized, we use a random subset of wikipedia. (The format is simply one sentence per line, but **need to be tokenized (separated by spaces)!!**)
- Task data: The dependency parsing data are in CoNLL-U format, which are available from the official UD website. The NER data should otherwise be in the format like those in CoNLL03.

### Running

- Step 0: Setup

Assume we are at a new DIR, and please download this repo into a DIR called `src`: `git clone https://github.com/zzsfornlp/zmsp src` and specify some ENV variables (for convenience):

SRC_DIR: Root dir of this repo
CUR_LANG: Lang id of the current language (for example en)
WIKI_PRETRAIN_SIZE: Pretraining size
UD_TRAIN_SIZE: Task training size

- Step 1: Build dictionary with pre-trained data

Before this, we should have the data prepared (for pre-training and task-training).

Assume that we have UD files at `data/UD_RUN/ud24s/${CUR_LANG}_train.${UD_TRAIN_SIZE}.conllu`, and pre-training (wiki) files at `data/UD_RUN/wikis/wiki_${CUR_LANG}.${WIKI_PRETRAIN_SIZE}.txt`.

Assmuing now we are at DIR `data/UD_RUN/vocabs/voc_${CUR_LANG}`, we first create vocabulary for this setting with:

PYTHONPATH=${SRC_DIR} python3 ${SRC_DIR}/tasks/cmd.py zmlm.main.vocab_utils train:../../wikis/wiki_en.${WIKI_PRETRAIN_SIZE}.txt input_format:plain norm_digit:1 >vv.list

The `vv_*` files at this dir will be the vocabularies that will be utilized for the remaining steps.

- Step 2: Do pre-training

Assuming now we are at DIR `data/..`

Simply use the script of `${SRC_DIR}/scripts/lbag/run.py` for pre-training.

python3 ${SRC_DIR}/scripts/lbag/run.py -l ${CUR_LANG} --rgpu 0 --run_dir run_orp_${CUR_LANG} --enc_type trans --run_mode pre --pre_mode orp --train_size ${WIKI_PRETRAIN_SIZE} --do_test 0

Note that by default, the data dirs are already pre-set as the ones in step 1, the paths can also be specified, please refer to the script for more details.

There are various modes for pre-training, the most typical ones are: orp (or lbag, our local reordering strategy), mlm (masked LM), om (orp+mlm). Please use `--pre_mode` to specify.

This may take a while (it took us three days to pretrain with 1M data on a single GPU). After this, we get the pre-trained models at `run_orp_${CUR_LANG}`.

- Step 3: Fine-tuning on specific tasks

Finally, training (fine-tuning) on specific tasks (here on Dep+Pos with UD data) with the pre-trained model. We can still use the script of `${SRC_DIR}/scripts/lbag/run.py`, simply change the `--run_mode` to `ppp1`, together with other information.

python3 ${SRC_DIR}/scripts/lbag/run.py -l ${CUR_LANG} --rgpu 0 --cur_run 1 --run_dir run_ppp1_${CUR_LANG} --run_mode ppp1 --train_size ${UD_TRAIN_SIZE} --preload_prefix ../run_orp_${CUR_LANG}/zmodel.c200

Here, we use the `checkpoint@200` model from the pre-trained dir, other checkpoints can also be specified (only providing the unambigous model prefix will be enough).

Again, paths are by default the ones we setup from Step 1, if using other paths, things can be also specified with various `--*_dir`.
2 changes: 1 addition & 1 deletion msp/zext/annotators/stanford.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def zwarn(s):
zlog("!!"+str(s))

# =====
LANG_NAME_MAP = {'es': 'spanish', 'zh': 'chinese', 'en': 'english'}
LANG_NAME_MAP = {'es': 'spanish', 'zh': 'chinese', 'en': 'english', 'ru': 'russian', 'uk': 'ukraine'}

# map CTB pos tags to UD (partially reference "Developing Universal Dependencies for Mandarin Chinese")
CORENLP_POS_TAGS = [
Expand Down
Loading

0 comments on commit c58ccd9

Please sign in to comment.