MlingConf: A Comprehensive Investigation of Multilingual Confidence Estimation for Large Language Models
[📄 Paper Link]
This project MlingConf introduce a comprehensive investigation of Multilingual Confidence estimation on LLMs, focusing on both language-agnostic (LA) and languagespecific (LS) tasks to explore the performance and language dominance effects of multilingual confidence estimations on different tasks.
The benchmark comprises four meticulously checked and human-evaluate high-quality multilingual datasets for LA tasks and one for the LS task tailored to specific social, cultural, and geographical contexts of a language. The proposed MlingConf datasets are constructed as follows.
python code/preparation.py --dataset triviaqa
python code/translate.py --stage translate --dataset triviaqa
python code/preparation.py --dataset common
python code/translate.py --stage translate --dataset common
# model_name: [llama3, gpt-3.5, llama2, vicuna]
# dataset: [triviaqa, common, gsm8k, sciq, lsqa]
# max_length: [16, 16, 200, 16, 16]
CUDA_VISIBLE_DEVICES=1 python code/inference.py --model_name llama3 --dataset triviaqa --max_length 16
# model_name: [llama3, gpt-3.5, llama2, vicuna]
# dataset: [triviaqa, common, gsm8k, sciq, lsqa]
CUDA_VISIBLE_DEVICES=1 python code/confidence.py --model_name llama3 --dataset triviaqa --max_length 48
python code/evaluation.py --model llama3 --dataset triviaqa