SoICT Hackathon 2023 Track NLU Solutions

1. Inference

* Note: Checkpoints links

ASR: https://huggingface.co/thanhduycao/wav2vec2-large-finetune-aug-on-fly-synthesis-fix-60-epoch-v2/ (checkpoint info(revision code) in predict.sh)
Spoken-norm: https://huggingface.co/linhtran92/finetuned_taggenv2_55epoch_encoder_embeddings
NLU: file JointIDSF_PhoBERTencoder.zip in folder training/soict_hackathon_JointIDSF/

Run all using .sh file

Download JointIDSF model and move it to the folder training/soict_hackathon_JointIDSF/
Link to the model zip: https://drive.google.com/drive/folders/1SXvzXiHb-0OI4c7PfYpfmxO_oVQxO-s-?usp=sharing

#set up requirements
chmod +x scripts/run_commands.sh
scripts/run_commands.sh

chmod +x scripts/predict.sh
scripts/predict.sh

The results will be in folder training/soict_hackathon_JointIDSF/ under file name "predictions.jsonl"

2. Training

2.1 Train ASR

More training instructions details are in README.md of this folder

cd training/ASR-Wav2vec-Finetune
chmod +x asr_train.sh
./asr_train.sh
cd ../..

2.2 Train spoken-norm

More training instructions details are in README.md of this folder

cd training/norm-tuned
chmod +x norm_train.sh
./norm_train.sh
cd ../..

2.3 Train NLU

More training instructions details are in README.md of this folder

cd training
chmod 755 -R soict_hackathon_JointIDSF
cd soict_hackathon_JointIDSF
#(important)
# before running nlu_train.sh, make sure to delete "rm -rf models", 
# and delete "rm -rf data_aug_full_0919_22" if these folders exist
!rm -rf models/
!rm -rf data_aug_full_0919_22/
chmod +x nlu_train.sh
./nlu_train.sh
cd ../..

3. Synthesis data

3.1 Installation

cd synthesis-data-for-ASR
pip install -r requirements.txt

3.2 Create data

CUDA_VISIBLE_DEVICES=0 python create_transcription_wer.py --data_links="thanhduycao/soict_train_dataset" --output_path="thanhduycao/soict_train_dataset_with_wer_validate" --token="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" --num_workers=2

CUDA_VISIBLE_DEVICES=0 python lyric-alignment/predict.py --data_links="thanhduycao/soict_train_dataset_with_wer_validate" --output_path="thanhduycao/data_for_synthesis_with_entities_align_v5_validate" --token="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" --num_workers=4

CUDA_VISIBLE_DEVICES=0 python create_entity_dataset.py --data_links="thanhduycao/data_for_synthesis_with_entities_align_v5_validate" --output_path="thanhduycao/data_for_synthesis_entities_validate" --token="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" --num_workers=1

CUDA_VISIBLE_DEVICES=0 python create_synthesis_dataset.py --data_links="thanhduycao/data_for_synthesis_with_entities_align_v5_validate" --output_path="thanhduycao/data_synthesis_validate" --token="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" --num_workers=1

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
kenlm		kenlm
norm		norm
scripts		scripts
spelling_vi_norm		spelling_vi_norm
synthesis-data-for-ASR		synthesis-data-for-ASR
tokenizers		tokenizers
training		training
utils		utils
wav2vec2		wav2vec2
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SoICT Hackathon 2023 Track NLU Solutions

1. Inference

* Note: Checkpoints links

Run all using .sh file

2. Training

2.1 Train ASR

2.2 Train spoken-norm

2.3 Train NLU

3. Synthesis data

3.1 Installation

3.2 Create data

About

Releases

Packages

Languages

quocanh34/soict-SLU

Folders and files

Latest commit

History

Repository files navigation

SoICT Hackathon 2023 Track NLU Solutions

1. Inference

* Note: Checkpoints links

Run all using .sh file

2. Training

2.1 Train ASR

2.2 Train spoken-norm

2.3 Train NLU

3. Synthesis data

3.1 Installation

3.2 Create data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages