cd pretrain/chinese_process/
python collect.py
cd pretrain/english_process/
python collect.py
Note that the redpajama_train.json
is obtained by running the following command:
cd pretrain
python download_from_hf.py
First of all, get into the folder train
, and run combine the chinese and english datasets and get the train.txt
file:
./combine_chinese_corpus.sh
Then, split the train.txt
into 8 subfiles (each for one GPU process to load):
# shuffle and split
./split.sh
Download the QASPER-v0.3 dataset from the path listed in ../README.md
and run:
python collect.py
Simple download the scientific emotional dialogue dataset and put it under the ./data/sft/emotional
folder.
Download Dolly corpus by run:
cd data/sft/dolly
python download_from_hf.py
Download SciMRC dataset by the link listed in ../README
, and process it by running the following command:
python collect.py
Combine the dolly, SciMRC and QASPER instruction dataset for supervised fine-tuning (without emotional dialogue dataset):
python combine.py