From 48ff980c91f340f403d346476a1f6af89b5635f3 Mon Sep 17 00:00:00 2001 From: mengliu1998 <604629@gmail.com> Date: Sat, 1 Jun 2024 15:07:38 -0700 Subject: [PATCH] Refine the guideline --- docs/source/developer_notes/evaluation.rst | 53 ++++++++++++++++------ docs/source/tutorials/eval_a_rag.rst | 9 ++++ 2 files changed, 47 insertions(+), 15 deletions(-) diff --git a/docs/source/developer_notes/evaluation.rst b/docs/source/developer_notes/evaluation.rst index dcf15370..ae10f729 100644 --- a/docs/source/developer_notes/evaluation.rst +++ b/docs/source/developer_notes/evaluation.rst @@ -34,22 +34,28 @@ To comprehensively assess the capabilities of LLMs, researchers typically utiliz * `Chatbot Arena `_, which is an open platform to evaluate LLMs through human voting. * `API-Bank `_, which evaluates LLMs' ability to use external tools and APIs to perform tasks. -Please refer to the review papers [1]_, [2]_, [6]_ for a more comprehensive overview of the datasets and benchmarks used in LLM evaluations. Additionally, a lot of datasets are readily accessible via the `Hugging Face Datasets `_ library. For instance, the MMLU dataset can be easily loaded from the Hub using the following code snippet. +Please refer to the review papers (*Chang et al.* [1]_, *Guo et al.* [2]_, and *Liu et al.* [6]_) for a more comprehensive overview of the datasets and benchmarks used in LLM evaluations. Additionally, a lot of datasets are readily accessible via the `Hugging Face Datasets `_ library. For instance, the MMLU dataset can be easily loaded from the Hub using the following code snippet. .. code-block:: python + :linenos: from datasets import load_dataset - dataset = load_dataset(path="mmlu", name='abstract_algebra') - print(dataset) + dataset = load_dataset(path="cais/mmlu", name='abstract_algebra') + print(dataset["test"]) + # Dataset({ + # features: ['question', 'subject', 'choices', 'answer'], + # num_rows: 100 + # }) How to evaluate? ------------------------------------------ -The final question is how to evaluate. Evaluation methods can be divided into *automated evaluation* and *human evaluation* [1]_, [6]_. Automated evaluation typically involves using metrics such as accuracy and BERTScore or employing an LLM as the judge, to quantitatively assess the performance of LLMs on specific tasks. Human evaluation, on the other hand, involves human in the loop to evaluate the quality of the generated text or the performance of the LLM on specific tasks. Here, we recommend a few evaluation methods for LLMs. +The final question is how to evaluate. Evaluation methods can be divided into *automated evaluation* and *human evaluation* (*Chang et al.* [1]_ and *Liu et al.* [6]_). Automated evaluation typically involves using metrics such as accuracy and BERTScore or employing an LLM as the judge, to quantitatively assess the performance of LLMs on specific tasks. Human evaluation, on the other hand, involves human in the loop to evaluate the quality of the generated text or the performance of the LLM. Here, we recommend a few automated evaluation methods that can be used to evaluate LLMs and their applications. -If you are interested in computing metrics such as accuracy, F1-score, ROUGE, BERTScore, perplexity, etc for LLMs and LLM applications, you can check out the metrics provided by `Hugging Face Metrics `_ or `TorchMetrics `_. For instance, you can use the following code snippet to compute the BERTScore, which uses the pre-trained contextual embeddings from BERT and matched words in generated text and reference text by cosine similarity. +If you are interested in computing metrics such as accuracy, F1-score, ROUGE, BERTScore, perplexity, etc for LLMs and LLM applications, you can check out the metrics provided by `Hugging Face Metrics `_ or `TorchMetrics `_. For instance, to compute the BERTScore, you can use the corresponding metric function provided by Hugging Face, which uses the pre-trained contextual embeddings from BERT and matched words in generated text and reference text by cosine similarity. .. code-block:: python + :linenos: from datasets import load_dataset bertscore = load_metric("bertscore") @@ -57,20 +63,37 @@ If you are interested in computing metrics such as accuracy, F1-score, ROUGE, BE reference_text = ["life is great", "make it to the moon"] results = bertscore.compute(predictions=generated_text, references=reference_text, model_type="distilbert-base-uncased") print(results) - {'precision': [0.9419728517532349, 0.7959791421890259], 'recall': [0.9419728517532349, 0.7749403119087219], 'f1': [0.9419728517532349, 0.7853187918663025], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.38.2)'} + # {'precision': [0.9419728517532349, 0.7959791421890259], 'recall': [0.9419728517532349, 0.7749403119087219], 'f1': [0.9419728517532349, 0.7853187918663025], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.38.2)'} If you are particulay interested in evaluating RAG (Retrieval-Augmented Generation) pipelines, we have several metrics available in LightRAG to assess both the quality of the retrieved context and the quality of the final generated answer. -- :class:`RetrieverEvaluator `: This evaluator is used to evaluate the performance of the retriever component of the RAG pipeline. It has the following metric functions: - - :obj:`compute_recall`: This function computes the recall of the retriever. It is defined as the number of relevant documents retrieved by the retriever divided by the total number of relevant documents in the knowledge base. - - :obj:`compute_context_relevance`: This function computes the relevance of the retrieved context. It is defined as the ratio of the number of relevant context tokens in the retrieved context to the total number of tokens in the retrieved context. -- :class:`AnswerMacthEvaluator `: This evaluator is used to evaluate the performance of the generator component of the RAG pipeline. It has the following metric functions: - - :obj:`compute_match_acc (if type is 'exact_match')`: This function computes the exact match accuracy of the generated answer. It is defined as the number of generated answers that exactly match the ground truth answer divided by the total number of generated answers. - - :obj:`compute_match_acc (if type is 'fuzzy_match')`: This function computes the fuzzy match accuracy of the generated answer. It is defined as the number of generated answers that contain the ground truth answer divided by the total number of generated answers. -- :class:`LLMasJudge `: This evaluator uses an LLM to get the judgement of the predicted answer for a list of questions. The task description and the judgement query of the LLM judge can be customized. - - :obj:`compute_judgement`: This function computes the judgement of the predicted answer. It is defined as the number of generated answers that are judged as correct by the LLM divided by the total number of generated answers. +- :class:`RetrieverEvaluator `: This evaluator is used to evaluate the performance of the retriever component of the RAG pipeline. It has metric functions to compute the recall and context relevance of the retriever. +- :class:`AnswerMacthEvaluator `: This evaluator is used to evaluate the performance of the generator component of the RAG pipeline. It has metric functions to compute the exact match and fuzzy match accuracy of the generated answer. +- :class:`LLMasJudge `: This evaluator uses an LLM to get the judgement of the predicted answer for a list of questions. The task description and the judgement query of the LLM judge can be customized. It has a metric function to compute the judgement score, which is the number of generated answers that are judged as correct by the LLM divided by the total number of generated answers. -Please refer to the tutorial on `Evaluating a RAG Pipeline <>`_ for more details on how to use these evaluators. For more metrics for evaluating RAG pipelines, you can check out the `RAGAS `_ library, which also has a set of metrics for evaluating RAG pipelines. +For example, you can use the following code snippet to compute the recall and relevance of the retriever component of the RAG pipeline for a single query. + +.. code-block:: python + :linenos: + + from eval.evaluators import RetrieverEvaluator + retrieved_context = "Apple is founded before Google." # Retrieved context + gt_context = ["Apple is founded in 1976.", + "Google is founded in 1998.", + "Apple is founded before Google."] # Ground truth context + retriever_evaluator = RetrieverEvaluator() # Initialize the RetrieverEvaluator + recall = retriever_evaluator.compute_recall_single_query( + retrieved_context, gt_context + ) # Compute the recall of the retriever + relevance = retriever_evaluator.compute_context_relevance_single_query( + retrieved_context, gt_context + ) # Compute the relevance of the retriever + print(f"Recall: {recall}, Relevance: {relevance}") + # Recall: 0.3333333333333333, Relevance: 1.0 + +For a more detailed instructions on how to use these evaluators to evaluate RAG pipelines, you can refer to the tutorial on :doc:`Evaluating a RAG Pipeline <../tutorials/eval_a_rag>`, where we provide a step-by-step guide on how to use these evaluators to evaluate a RAG pipeline on HotpotQA dataset. + +If you intent to use metrics that are not available in the LightRAG library, you can also implement your own custom metric functions or use other libraries such as `RAGAS `_ to compute the desired metrics for evaluating RAG pipelines. .. [1] Chang, Yupeng, et al. "A survey on evaluation of large language models." ACM Transactions on Intelligent Systems and Technology 15.3 (2024): 1-45. diff --git a/docs/source/tutorials/eval_a_rag.rst b/docs/source/tutorials/eval_a_rag.rst index 7f211583..798d364c 100644 --- a/docs/source/tutorials/eval_a_rag.rst +++ b/docs/source/tutorials/eval_a_rag.rst @@ -25,6 +25,7 @@ Let's walk through the code to evaluate a RAG pipeline step by step. We import the necessary dependencies for our evaluation script. These include modules for loading datasets, constructing a RAG pipeline, and evaluating the performance of the RAG pipeline. .. code-block:: + :linenos: import yaml @@ -48,6 +49,7 @@ We import the necessary dependencies for our evaluation script. These include mo We load the configuration settings from `a YAML file `_. This file contains various parameters for the RAG pipeline. You can customize these settings based on your requirements. .. code-block:: + :linenos: with open("./configs/rag_hotpotqa.yaml", "r") as file: settings = yaml.safe_load(file) @@ -56,6 +58,7 @@ We load the configuration settings from `a YAML file `_ as an example. Each data sample in HotpotQA has *question*, *answer*, *context* and *supporting_facts* selected from the whole context. We load the HotpotQA dataset using the :obj:`load_dataset` function from the `datasets `_ module. We select a subset of the dataset as an example for evaluation purposes. .. code-block:: + :linenos: dataset = load_dataset(path="hotpot_qa", name="fullwiki") dataset = dataset["train"].select(range(5)) @@ -64,6 +67,7 @@ In this tutorial, we use the `HotpotQA dataset `_. .. code-block:: + :linenos: all_questions = [] all_retrieved_context = [] @@ -120,6 +126,7 @@ To get the ground truth context string from the *supporting_facts* filed in Hotp We first evaluate the performance of the retriever component of the RAG pipeline. We compute the average recall and context relevance for each query using the :class:`RetrieverEvaluator ` class. .. code-block:: + :linenos: retriever_evaluator = RetrieverEvaluator() avg_recall, recall_list = retriever_evaluator.compute_recall( @@ -134,6 +141,7 @@ We first evaluate the performance of the retriever component of the RAG pipeline Next, we evaluate the performance of the generator component of the RAG pipeline. We compute the average exact match accuracy for each query using the :class:`AnswerMacthEvaluator ` class. .. code-block:: + :linenos: generator_evaluator = AnswerMacthEvaluator(type="fuzzy_match") answer_match_acc, match_acc_list = generator_evaluator.compute_match_acc( @@ -146,6 +154,7 @@ Finally, we evaluate the performance of the generator component of the RAG pipel Note that :obj:`task_desc_str` and :obj:`judgement_query` can be customized. .. code-block:: + :linenos: llm_evaluator = Generator( model_client=OpenAIClient(),