Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [Eval guide and tutorial] #20

Merged
merged 9 commits into from
Jun 2, 2024
Merged

[WIP] [Eval guide and tutorial] #20

merged 9 commits into from
Jun 2, 2024

Conversation

mengliu1998
Copy link
Contributor

@mengliu1998 mengliu1998 commented May 19, 2024

  • A guideline for LLM evaluation
  • a tutorial for evaluating a RAG pipeline
  • Refine the doc strings for evaluators

Evaluation and Metrics
=========

Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. In LightRAG, we provide a set of metrics in :ref:`our evaluators <evaluators>`. In this tutorial, we will show how to use them to evaluate the performance of the retriever and generator components of a RAG pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some research papers/resources from which you designed the metrics and why they are used in RAG?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the guideline

print(f"Average recall: {avg_recall}")
print(f"Average relevance: {avg_relevance}")

Next, we evaluate the performance of the generator component of the RAG pipeline. We compute the average exact match accuracy for each query using the :class:`AnswerMacthEvaluator <eval.evaluators.AnswerMacthEvaluator>` class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the number of the final metrics, would be good to see a table with different models used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will add them when the core code change is done (The RAG use case code needs to be changed accordingly).

@mengliu1998 mengliu1998 force-pushed the meng branch 2 times, most recently from 823b560 to 47d2b9b Compare May 29, 2024 04:36
@mengliu1998 mengliu1998 changed the title [WIP] [documentation for eval and ICL + ICL component refinement] [WIP] [Eval guide and tutorial] May 29, 2024
Copy link
Member

@liyin2015 liyin2015 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge now, you can optimize it later with feedback

docs/source/developer_notes/evaluation.rst Show resolved Hide resolved
If you are particulay interested in evaluating RAG (Retrieval-Augmented Generation) pipelines, we have several metrics available in LightRAG to assess both the quality of the retrieved context and the quality of the final generated answer.

- :class:`RetrieverEvaluator <eval.evaluators.RetrieverEvaluator>`: This evaluator is used to evaluate the performance of the retriever component of the RAG pipeline. It has metric functions to compute the recall and context relevance of the retriever.
- :class:`AnswerMacthEvaluator <eval.evaluators.AnswerMacthEvaluator>`: This evaluator is used to evaluate the performance of the generator component of the RAG pipeline. It has metric functions to compute the exact match and fuzzy match accuracy of the generated answer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to provider better than exact and fuzzy default metrics for generator

"Google is founded in 1998.",
"Apple is founded before Google."] # Ground truth context
retriever_evaluator = RetrieverEvaluator() # Initialize the RetrieverEvaluator
recall = retriever_evaluator.compute_recall_single_query(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any example on a list of retrieved context? In real case, it will always be top_k, and it will be on an eval set of size n, so it would be list[list] instead of single retrieved_context

print(results)
# {'precision': [0.9419728517532349, 0.7959791421890259], 'recall': [0.9419728517532349, 0.7749403119087219], 'f1': [0.9419728517532349, 0.7853187918663025], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.38.2)'}

If you are particulay interested in evaluating RAG (Retrieval-Augmented Generation) pipelines, we have several metrics available in LightRAG to assess both the quality of the retrieved context and the quality of the final generated answer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can let users know that we only have a subset of metrics for RAG for now.

And we can use a table to list all proper metrics.

A Guideline on LLM Evaluation
============

Evaluating LLMs and their applications is crucial for understanding their capabilities and limitations. Overall, such evaluation is a complex and multifaceted process. Below, we provide a guideline for evaluating LLMs and their applications, incorporating aspects outlined by *Chang et al.* [1]_:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the points we want to achieve:
(1) be very straightforward and provide info in concise way.
(2) "Evaluating llms and their applications is crucial in both research and production." You can not improve what you cant measure. How you measure it decides the product user experience you want to deliver. Researchers care more on benchmarks. Production can focus more on your tasks, your data, and pick proper metrics that can help you the best. You might consider to experiment on a similar public dataset/benchmarks to your task for development and comparison.
(3) some essential points got lost in this guide. 1. classical ml you use classicial metrics 2. generative ai you use generativ metrics.
(4) this is supposed to talk about more on metrics, but somehow i did not see any more and holistic metrics listing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all these comments. Will update accordingly!

Evaluating a RAG Pipeline
=========

Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. In LightRAG, we provide a set of metrics in :ref:`our evaluators <evaluators>`. In this tutorial, we will show how to use them to evaluate the performance of the retriever and generator components of a RAG pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. " -> this whole sentence is off topic.

Here we can focus just on rag and start with what metrics are helpful. In terms of RAG evaluation, our guideline isnt clear enough.


The full code for this tutorial can be found in `use_cases/rag_hotpotqa.py <https://github.com/SylphAI-Inc/LightRAG/blob/main/use_cases/rag_hotpotqa.py>`_.

RAG (Retrieval-Augmented Generation) pipelines leverage a retriever to fetch relevant context from a knowledge base (e.g., a document database) which is then fed to an LLM generator with the query to produce the answer. This allows the model to generate more contextually relevant answers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can mention decrease halluciation

@@ -51,14 +59,17 @@ def compute_match_acc_single_query(self, pred_answer: str, gt_answer: str) -> fl
def compute_match_acc(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) would suggest to separate these evaluators in three files
(2) instead of each one has different fun name, use the same fun name so user will have easier time. Something like compute or run or something standard or even use call

LLM as judge for evaluating the performance of a LLM.

Args:
llm_evaluator (Generator): LLM model to be used as judge
"""

def __init__(self, llm_evaluator: Generator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should try to provide a starter llm_evaluator here in the code

@mengliu1998 mengliu1998 merged commit 23cdf23 into main Jun 2, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants