New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[WIP] [Eval guide and tutorial] #20

Merged

mengliu1998 merged 9 commits into main from meng

Jun 2, 2024

Contributor

mengliu1998 commented May 19, 2024 •

edited

Loading

A guideline for LLM evaluation
a tutorial for evaluating a RAG pipeline
Refine the doc strings for evaluators

liyin2015 reviewed

View reviewed changes

docs/source/tutorials/eval_and_metrics.rst Outdated

+              Evaluation and Metrics
+              =========
+              Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. In LightRAG, we provide a set of metrics in :ref:`our evaluators <evaluators>`. In this tutorial, we will show how to use them to evaluate the performance of the retriever and generator components of a RAG pipeline

Member

liyin2015 May 20, 2024

can you add some research papers/resources from which you designed the metrics and why they are used in RAG?

Contributor Author

mengliu1998 Jun 1, 2024

Added in the guideline

docs/source/tutorials/eval_and_metrics.rst Outdated

+                  print(f"Average recall: {avg_recall}")
+                  print(f"Average relevance: {avg_relevance}")
+              Next, we evaluate the performance of the generator component of the RAG pipeline. We compute the average exact match accuracy for each query using the :class:`AnswerMacthEvaluator <eval.evaluators.AnswerMacthEvaluator>` class.

Member

liyin2015 May 20, 2024

where is the number of the final metrics, would be good to see a table with different models used

Contributor Author

mengliu1998 Jun 1, 2024

Good point. Will add them when the core code change is done (The RAG use case code needs to be changed accordingly).

mengliu1998 force-pushed the meng branch 2 times, most recently from 823b560 to 47d2b9b Compare

May 29, 2024 04:36

mengliu1998 changed the title ~~[WIP] [documentation for eval and ICL + ICL component refinement]~~ [WIP] [Eval guide and tutorial]

mengliu1998 force-pushed the meng branch from a3ce89d to 8bada59 Compare

June 1, 2024 18:12

liyin2015 approved these changes

View reviewed changes

Member

liyin2015 left a comment

Will merge now, you can optimize it later with feedback

docs/source/developer_notes/evaluation.rst Show resolved Hide resolved

docs/source/developer_notes/evaluation.rst

+              If you are particulay interested in evaluating RAG (Retrieval-Augmented Generation) pipelines, we have several metrics available in LightRAG to assess both the quality of the retrieved context and the quality of the final generated answer.
+              - :class:`RetrieverEvaluator <eval.evaluators.RetrieverEvaluator>`: This evaluator is used to evaluate the performance of the retriever component of the RAG pipeline. It has metric functions to compute the recall and context relevance of the retriever.
+              - :class:`AnswerMacthEvaluator <eval.evaluators.AnswerMacthEvaluator>`: This evaluator is used to evaluate the performance of the generator component of the RAG pipeline. It has metric functions to compute the exact match and fuzzy match accuracy of the generated answer.

Member

liyin2015 Jun 2, 2024

we need to provider better than exact and fuzzy default metrics for generator

docs/source/developer_notes/evaluation.rst

+                                "Google is founded in 1998.",
+                                "Apple is founded before Google."] # Ground truth context
+                  retriever_evaluator = RetrieverEvaluator() # Initialize the RetrieverEvaluator
+                  recall = retriever_evaluator.compute_recall_single_query(

Member

liyin2015 Jun 2, 2024

any example on a list of retrieved context? In real case, it will always be top_k, and it will be on an eval set of size n, so it would be list[list] instead of single retrieved_context

docs/source/developer_notes/evaluation.rst

+                  print(results)
+                  # {'precision': [0.9419728517532349, 0.7959791421890259], 'recall': [0.9419728517532349, 0.7749403119087219], 'f1': [0.9419728517532349, 0.7853187918663025], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.38.2)'}
+              If you are particulay interested in evaluating RAG (Retrieval-Augmented Generation) pipelines, we have several metrics available in LightRAG to assess both the quality of the retrieved context and the quality of the final generated answer.

Member

liyin2015 Jun 2, 2024

we can let users know that we only have a subset of metrics for RAG for now.

And we can use a table to list all proper metrics.

docs/source/developer_notes/evaluation.rst

+              A Guideline on LLM Evaluation
+              ============
+              Evaluating LLMs and their applications is crucial for understanding their capabilities and limitations. Overall, such evaluation is a complex and multifaceted process. Below, we provide a guideline for evaluating LLMs and their applications, incorporating aspects outlined by *Chang et al.* [1]_:

Member

liyin2015 Jun 2, 2024

Here are the points we want to achieve:
(1) be very straightforward and provide info in concise way.
(2) "Evaluating llms and their applications is crucial in both research and production." You can not improve what you cant measure. How you measure it decides the product user experience you want to deliver. Researchers care more on benchmarks. Production can focus more on your tasks, your data, and pick proper metrics that can help you the best. You might consider to experiment on a similar public dataset/benchmarks to your task for development and comparison.
(3) some essential points got lost in this guide. 1. classical ml you use classicial metrics 2. generative ai you use generativ metrics.
(4) this is supposed to talk about more on metrics, but somehow i did not see any more and holistic metrics listing.

Contributor Author

mengliu1998 Jun 2, 2024

Thanks for all these comments. Will update accordingly!

docs/source/tutorials/eval_a_rag.rst

+              Evaluating a RAG Pipeline
+              =========
+              Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. In LightRAG, we provide a set of metrics in :ref:`our evaluators <evaluators>`. In this tutorial, we will show how to use them to evaluate the performance of the retriever and generator components of a RAG pipeline

Member

liyin2015 Jun 2, 2024

"Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. " -> this whole sentence is off topic.

Here we can focus just on rag and start with what metrics are helpful. In terms of RAG evaluation, our guideline isnt clear enough.

docs/source/tutorials/eval_a_rag.rst


		The full code for this tutorial can be found in `use_cases/rag_hotpotqa.py <https://github.com/SylphAI-Inc/LightRAG/blob/main/use_cases/rag_hotpotqa.py>`_.

		RAG (Retrieval-Augmented Generation) pipelines leverage a retriever to fetch relevant context from a knowledge base (e.g., a document database) which is then fed to an LLM generator with the query to produce the answer. This allows the model to generate more contextually relevant answers.

Member

liyin2015 Jun 2, 2024

can mention decrease halluciation

eval/evaluators.py

		@@ -51,14 +59,17 @@ def compute_match_acc_single_query(self, pred_answer: str, gt_answer: str) -> fl
		def compute_match_acc(

Member

liyin2015 Jun 2, 2024

(1) would suggest to separate these evaluators in three files
(2) instead of each one has different fun name, use the same fun name so user will have easier time. Something like compute or run or something standard or even use call

eval/evaluators.py

                   LLM as judge for evaluating the performance of a LLM.
-                  Args:
-                      llm_evaluator (Generator): LLM model to be used as judge
                   """
                   def __init__(self, llm_evaluator: Generator):

Member

liyin2015 Jun 2, 2024

we should try to provide a starter llm_evaluator here in the code

mengliu1998 added 9 commits

June 2, 2024 10:23


          Documentation for eval

1a83389


          Add the eval tutorial

e19cff0


          Restructure the eval doc

f49c9ff


          Restructure the eval doc

6c60a75


          Summarize what to evaluate

21ff27a


          Add where and how to evaluate

9d8c0cd


          Add how to evaluate

f347aff


          Refine the docstrings for evaluators

1c6b7fc


          Refine the guideline

48ff980

mengliu1998 force-pushed the meng branch from 179a41e to 48ff980 Compare

June 2, 2024 17:24

mengliu1998 merged commit 23cdf23 into main

2 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet