Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
đź“Ł This paper has been accepted to ICLR 2025!
đź“Ł We are releasing Trust-Score, a holistic evaluation of the trustworthiness of LLMs in a RAG framework, and the Trust-Align framework that aligns LLMs for higher Trust-Score. Paper
We are excited to announce the release of Trust-Score evaluation datasets and Trust-Align alignment datasets:
-
Trust-Score: It features calibrated questions and refusals to measure the model's trustworthiness.
-
Trust-Align: Enhance the model's trustworthiness with high-quality synthesized cited responses.
LLMs are an integral part of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the quality of end-to-end RAG systems, there is a lack of research on understanding the appropriateness of an LLM for the RAG task. Thus, we introduce a new metric, Trust-Score, that provides a holistic evaluation of the trustworthiness of LLMs in an RAG framework. We show that various prompting methods, such as in-context learning, fail to adapt LLMs effectively to the RAG task. Thus, we propose Trust-Align, a framework to align LLMs for higher Trust-Score. LLaMA-3-8b, aligned with our method, significantly outperforms open-source LLMs of comparable sizes on ASQA (↑10.7), QAMPARI (↑29.2), and ELI5 (↑14.9).
conda env create -f environment.yml
conda activate cite
pip install -r requirements.txt
We use the latest version of alignment-handbook
for training (ver alignment-handbook-0.4.0.dev0
). We followwed the installation instructions on alignment-handbook repository:
git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
python -m pip install .
Trust-Score evaluation dataset is available on Huggingface.
Trust-Align training dataset is also available Huggingface.
Trust-Score is a more reliable and comprehensive measure of an LLM's capabilities for RAG, covering the following aspects: Does the LLM correctly identify answerable questions? Are the responses grounded in the provided documents, i.e., do the citations support the claims in the ground-truth response? And are the citations relevant?
We support three types of dataset format: EM (Exact Match, like ASQA type), EM@5 (top-5 EM, like QAMPARI type), or CM (Claim Match, like ELI5 type).
Your evaluation dataset should satisfy the following format:
The file contains a list of JSON dictionaries with the following fields:
-
question
- The question being asked.Example:
"question": "Who has the highest goals in world football?"
-
answers
- A list of all gold answers, where each element is an array containing different variations of each gold answer. The gold answers can either be in short form or full sentences.Examples:
"answers": [ ["Daei", "Ali Daei"], ["Bican", "Josef Bican"], ["Sinclair", "Christine Sinclair"] ]
or
"answers": [ [ "Firms like Snapchat and Uber need to establish their brand and amass users before introducing ads." ], [ "Introducing ads too early can deter potential users." ], [ "Uber is reinvesting a lot of money to make their service better." ] ]
-
docs
- A list of dictionaries providing evidence from documents related to the question. Each dictionary contains the following fields:title
- The title of the document.
Example:
"title": "Argentina–Brazil football rivalry"
text
- A snippet from the document containing relevant information.
Example:
"text": "Pelé's 1281 goals are recognized by FIFA as the highest total achieved by a professional footballer, although the Soccer Statistic Foundation (rssf) recognizes only 767 goals in official mode, occupying the third place after Josef Bican (805) and Romario (772)."
answers_found
- A list of integers, where each element corresponds to whether the answer was found in the document (0 if not found, 1 if found).
Example:
"answers_found": [ 0, 0, 0 ]
rec_score
- A recall score indicating the percentage of answers entailed by the document.
Example:
"rec_score": 0.0
You can easily evaluate your model based on the formatted evaluation dataset by running the following code:
from utils import RUN_Config
from trust_eval import TRUST_SCORE
config = RUN_Config()
# Assume you load this from a JSON or YAML file
example_config = {
"prompt_file": "prompts/asqa_rejection.json",
"eval_file": "data/asqa_eval_top100_calibrated.json",
"output_dir": "save",
"eval_type": "em",
"model": "meta-llama/Llama-2-7b-chat-hf",
"max_length": 4096,
"temperature": 0.5,
"top_p": 0.95,
"vllm": True,
"no_demo": True,
}
# Update config with new values
config.update_from_dict(example_config)
score = TRUST_SCORE(config)
print(score)
Please first refer to Retrieval in the ALCE benchmark to download the required document corpus (GTR-based Wikipedia snapshot and BM25-based Sphere)
Download the ASQA, QAMPARI, ELI5, and ExpertQA datasets accordingly.
You can reproduce the seed sample curation step with the following command:
cd TRUST_ALIGN/seed_samples
sh cluster.sh
sh re_cali.sh
In re_cali.sh
, remember to specify BM25_SPHERE_PATH
, DPR_WIKI_TSV
, and GTR_EMB
to the paths where you stored each corpus, respectively.
Output is the {dataset}_doc.json
in data
folder.
The choice of dataset
could be either asqa
, qampari
, eli5
, or expertqa
.
You can reproduce the augment sample curation step (document recombination) with the following command:
cd TRUST_ALIGN/augment_samples
sh doc_recombination.sh {dataset}
Output is the {dataset}_doc_augment.json
format in data\
folder.
You can create natural responses by running the following code:
cd TRUST_ALIGN/positives_synthesis
sh gen_ans.sh
In gen_ans.sh
, please specify the --data_file
with the path to your dataset.
To get positive responses with citations, run the following code:
python gen_positives --input_folder {dataset_folder}
{dataset_folder}
is the path to your saved datasets folder.
You first need to obtain the model's output for curated samples as follows:
cd TRUST_ALIGN/negatives_selection
sh infer.sh
In infer.sh
, you need to specify INFER_FILE
and OUTPUT_DIR
to the path you saved samples and the path you want to save the obtained output, respectively. You can also change the --config
inside for other datasets.
Based on obtained model's output, you can calculate .json
format stored in data/
folder.
sh error_selection.sh
In error_selection.sh
, you also need to specify BASE_DIR
and OUTPUT_DIR
to the path you saved samples and the path you want to save the obtained output, respectively.
Our training code is based on the alignment-handbook repository. We provide the complete training code and configuration files for both SFT and DPO. To get started, you'll need to customize the model_name_or_path
and output_dir
settings, and adjust the num_processes
and per_device_train_batch_size
parameters in the .yaml
configuration files according to your computational environment. For specifying the training dataset, use dataset_mixer
to point to your dataset, ensuring it is in the Hugging Face dataset format.
- SFT Training:
cd training
sh sft.sh
- DPO Training:
cd training
sh dpo.sh
If you have any questions related to the code or the paper, feel free to email Maojia ([email protected]
). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
Please cite our paper if you use Trust-align in your work:
@misc{song2024measuringenhancingtrustworthinessllms,
title={Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse},
author={Maojia Song and Shang Hong Sim and Rishabh Bhardwaj and Hai Leong Chieu and Navonil Majumder and Soujanya Poria},
year={2024},
eprint={2409.11242},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.11242},
}