-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [Eval guide and tutorial] #20
Conversation
mengliu1998
commented
May 19, 2024
•
edited
Loading
edited
- A guideline for LLM evaluation
- a tutorial for evaluating a RAG pipeline
- Refine the doc strings for evaluators
Evaluation and Metrics | ||
========= | ||
|
||
Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. In LightRAG, we provide a set of metrics in :ref:`our evaluators <evaluators>`. In this tutorial, we will show how to use them to evaluate the performance of the retriever and generator components of a RAG pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some research papers/resources from which you designed the metrics and why they are used in RAG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the guideline
print(f"Average recall: {avg_recall}") | ||
print(f"Average relevance: {avg_relevance}") | ||
|
||
Next, we evaluate the performance of the generator component of the RAG pipeline. We compute the average exact match accuracy for each query using the :class:`AnswerMacthEvaluator <eval.evaluators.AnswerMacthEvaluator>` class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the number of the final metrics, would be good to see a table with different models used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will add them when the core code change is done (The RAG use case code needs to be changed accordingly).
823b560
to
47d2b9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will merge now, you can optimize it later with feedback
If you are particulay interested in evaluating RAG (Retrieval-Augmented Generation) pipelines, we have several metrics available in LightRAG to assess both the quality of the retrieved context and the quality of the final generated answer. | ||
|
||
- :class:`RetrieverEvaluator <eval.evaluators.RetrieverEvaluator>`: This evaluator is used to evaluate the performance of the retriever component of the RAG pipeline. It has metric functions to compute the recall and context relevance of the retriever. | ||
- :class:`AnswerMacthEvaluator <eval.evaluators.AnswerMacthEvaluator>`: This evaluator is used to evaluate the performance of the generator component of the RAG pipeline. It has metric functions to compute the exact match and fuzzy match accuracy of the generated answer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to provider better than exact and fuzzy default metrics for generator
"Google is founded in 1998.", | ||
"Apple is founded before Google."] # Ground truth context | ||
retriever_evaluator = RetrieverEvaluator() # Initialize the RetrieverEvaluator | ||
recall = retriever_evaluator.compute_recall_single_query( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any example on a list of retrieved context? In real case, it will always be top_k, and it will be on an eval set of size n, so it would be list[list] instead of single retrieved_context
print(results) | ||
# {'precision': [0.9419728517532349, 0.7959791421890259], 'recall': [0.9419728517532349, 0.7749403119087219], 'f1': [0.9419728517532349, 0.7853187918663025], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.38.2)'} | ||
|
||
If you are particulay interested in evaluating RAG (Retrieval-Augmented Generation) pipelines, we have several metrics available in LightRAG to assess both the quality of the retrieved context and the quality of the final generated answer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can let users know that we only have a subset of metrics for RAG for now.
And we can use a table to list all proper metrics.
A Guideline on LLM Evaluation | ||
============ | ||
|
||
Evaluating LLMs and their applications is crucial for understanding their capabilities and limitations. Overall, such evaluation is a complex and multifaceted process. Below, we provide a guideline for evaluating LLMs and their applications, incorporating aspects outlined by *Chang et al.* [1]_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the points we want to achieve:
(1) be very straightforward and provide info in concise way.
(2) "Evaluating llms and their applications is crucial in both research and production." You can not improve what you cant measure. How you measure it decides the product user experience you want to deliver. Researchers care more on benchmarks. Production can focus more on your tasks, your data, and pick proper metrics that can help you the best. You might consider to experiment on a similar public dataset/benchmarks to your task for development and comparison.
(3) some essential points got lost in this guide. 1. classical ml you use classicial metrics 2. generative ai you use generativ metrics.
(4) this is supposed to talk about more on metrics, but somehow i did not see any more and holistic metrics listing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all these comments. Will update accordingly!
Evaluating a RAG Pipeline | ||
========= | ||
|
||
Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. In LightRAG, we provide a set of metrics in :ref:`our evaluators <evaluators>`. In this tutorial, we will show how to use them to evaluate the performance of the retriever and generator components of a RAG pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. " -> this whole sentence is off topic.
Here we can focus just on rag and start with what metrics are helpful. In terms of RAG evaluation, our guideline isnt clear enough.
|
||
The full code for this tutorial can be found in `use_cases/rag_hotpotqa.py <https://github.com/SylphAI-Inc/LightRAG/blob/main/use_cases/rag_hotpotqa.py>`_. | ||
|
||
RAG (Retrieval-Augmented Generation) pipelines leverage a retriever to fetch relevant context from a knowledge base (e.g., a document database) which is then fed to an LLM generator with the query to produce the answer. This allows the model to generate more contextually relevant answers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can mention decrease halluciation
@@ -51,14 +59,17 @@ def compute_match_acc_single_query(self, pred_answer: str, gt_answer: str) -> fl | |||
def compute_match_acc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) would suggest to separate these evaluators in three files
(2) instead of each one has different fun name, use the same fun name so user will have easier time. Something like compute
or run
or something standard or even use call
LLM as judge for evaluating the performance of a LLM. | ||
|
||
Args: | ||
llm_evaluator (Generator): LLM model to be used as judge | ||
""" | ||
|
||
def __init__(self, llm_evaluator: Generator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should try to provide a starter llm_evaluator here in the code