Training OpenThaiGPT Dataset

Training Steps

Install all requirements from PRETRAIN.md
Run submit_data_hf.sh to convert Huggingface datasets to JSONL format
2.1 Convert The Pile HF to The Pile JSONL
2.2 Convert OpenThaiGPT HF to OpenThaiGPT JSONL
Run submit_data_openthai.sh to tokenize and save OpenThaiGPT JSONL data to TinyLlama format
Run submit_data_thepile.sh to tokenize and save ThePile JSONL (Convert from Huggingface) data to TinyLlama format
Run submit_train.sh to train model

python scripts/convert_lit_checkpoint.py \
--checkpoint_name iter-200000-ckpt.pth \
--out_dir out/tinyllama_1b \
--model_name tiny_LLaMA_1b

python scripts/push_to_hub.py