Skip to content

Commit

Permalink
Documentation for eval
Browse files Browse the repository at this point in the history
  • Loading branch information
mengliu1998 committed May 19, 2024
1 parent 19d089e commit 4f958ed
Show file tree
Hide file tree
Showing 9 changed files with 63 additions and 21 deletions.
12 changes: 12 additions & 0 deletions docs/source/apis/eval/evaluators.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
.. _evaluators:

LightRAG.eval
=========================

eval.evaluators
------------------------------------

.. automodule:: eval.evaluators
:members:
:undoc-members:
:show-inheritance:
2 changes: 1 addition & 1 deletion docs/source/get_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ To start with LightRAG, please follow the steps:

4. (For contributors only) Install pre-commit into your git hooks using ``pre-commit install``, which will automatically check the code standard on every commit.

5. Now you should run any file in the repo.
5. Now you should be able to run any file in the repo.
2 changes: 1 addition & 1 deletion docs/source/get_started/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ What is LightRAG?

LightRAG comes from the best of the AI research and engineering.
Fundamentally, we ask ourselves: what kind of system that combines the
best of research(such as LLM), engineering (such as ‘jinja’) to build
best of research (such as LLM), engineering (such as ‘jinja’) to build
the best applications? We are not a framework. We do not want you to
directly install the package. We want you to carefully decide to take
modules and structures from here to build your own library and
Expand Down
6 changes: 4 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,19 +30,21 @@ LightRAG comes from the best of the AI research and engineering. Fundamentally,
:caption: Tutorials

tutorials/simpleQA
tutorials/eval_and_metrics
tutorials/in_context_learning

.. toctree::
:maxdepth: 1
:caption: API Reference

apis/components/components
apis/core/core
apis/eval/evaluators

.. toctree::
:glob:
:maxdepth: 1
:caption: Resources

resources/resources
resources/contributing

6 changes: 3 additions & 3 deletions docs/source/resources/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,15 @@ To effectively edit the LightRAG documentation, you have several options dependi

Locate the ``.rst`` file you want to edit within the ``docs/source`` directory. You can modify the content directly in any text editor. We are using `reStructuredText <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html>`_ as the language. For formatting help, refer to the reStructuredText Quickstart Guide:

- `Quickstart <https://docutils.sourceforge.io/docs/user/rst/quickstart.html>`
- `reStructuredText Markup Specification <https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html>`
- `Quickstart <https://docutils.sourceforge.io/docs/user/rst/quickstart.html>`_
- `reStructuredText Markup Specification <https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html>`_

**Create a New .rst File**

If you need to add a new section or topic:

- Create a new ``.rst`` file in the appropriate subdirectory within ``docs/source``.
- Write your content following `reStructuredText syntax <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html>`.
- Write your content following `reStructuredText syntax <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html>`_.
- If you are creating a new section, ensure to include your new file in the relevant ``toctree`` located usually in ``index.rst`` or within the closest parent ``.rst`` file, to make it appear in the compiled documentation.

**Convert a Markdown File to .rst Using Pandoc**
Expand Down
25 changes: 25 additions & 0 deletions docs/source/tutorials/eval_and_metrics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Evaluation and Metrics
=========

Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. In LightRAG, we provide a set of metrics in :ref:`our evaluators <evaluators>`. In this tutorial, we will show how to use them to evaluate the performance of the retriever and generator components of a RAG pipeline


Evaluating a RAG Pipeline
---------------------------------------
The full code for this tutorial can be found in `use_cases/rag_hotpotqa.py <https://github.com/SylphAI-Inc/LightRAG/blob/main/use_cases/rag_hotpotqa.py>`_.

RAG (Retrieval-Augmented Generation) pipelines leverage a retriever to fetch relevant context from a knowledge base (e.g., a document database) which is then fed to an LLM generator with the query to produce the answer. This allows the model to generate more contextually relevant answers.

Thus, to evaluate a RAG pipeline, we can assess both the quality of the retrieved context and the quality of the final generated answer. Speciafically, we can use the following evaluators and their corresponding metrics.

* :class:`RetrieverEvaluator <eval.evaluators.RetrieverEvaluator>`: This evaluator is used to evaluate the performance of the retriever component of the RAG pipeline. It has the following metric functions:
* :obj:`compute_recall`: This function computes the recall of the retriever. It is defined as the number of relevant strings retrieved by the retriever divided by the total number of relevant strings in the knowledge base.
* :obj:`compute_context_relevance`: This function computes the relevance of the retrieved context. It is defined as the ratio of the number of relevant context tokens in the retrieved context to the total number of tokens in the retrieved context.
* :class:`AnswerMacthEvaluator <eval.evaluators.AnswerMacthEvaluator>`: This evaluator is used to evaluate the performance of the generator component of the RAG pipeline. It has the following metric functions:
* :obj:`compute_match_acc (if type is 'exact_match')`: This function computes the exact match accuracy of the generated answer. It is defined as the number of generated answers that exactly match the ground truth answer divided by the total number of generated answers.
* :obj:`compute_match_acc (if type is 'fuzzy_match')`: This function computes the fuzzy match accuracy of the generated answer. It is defined as the number of generated answers that contain the ground truth answer divided by the total number of generated answers.
* :class:`LLMasJudge <eval.evaluators.LLMasJudge>`: This evaluator uses an LLM to get the judgement of the predicted answer for a list of questions. The task description and the judgement query of the LLM judge can be customized.
* :obj:`compute_judgement`: This function computes the judgement of the predicted answer. It is defined as the number of generated answers that are judged as correct by the LLM divided by the total number of generated answers.


TODO
3 changes: 3 additions & 0 deletions docs/source/tutorials/in_context_learning.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
In-Context Learning
=========
TODO
24 changes: 12 additions & 12 deletions eval/evaluator.py → eval/evaluators.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@


class AnswerMacthEvaluator:
"""
r"""
Evaluator for evaluating the match between predicted answer and ground truth answer.
Args:
type (str): Type of matching evaluation. Can be "exact_match" or "fuzzy_match". "exact_match" requires the predicted answer to be exactly the same as the ground truth answer. "fuzzy_match" requires the predicted answer to contain the ground truth answer.
Expand All @@ -33,7 +33,7 @@ def __init__(self, type: str = "exact_match"):
self.type = type

def compute_match_acc_single_query(self, pred_answer: str, gt_answer: str) -> float:
"""
r"""
Compute the match accuracy of the predicted answer for a single query.
Args:
pred_answer (str): Predicted answer string
Expand All @@ -51,7 +51,7 @@ def compute_match_acc_single_query(self, pred_answer: str, gt_answer: str) -> fl
def compute_match_acc(
self, all_pred_answer: List[str], all_gt_answer: List[str]
) -> Tuple[float, List[float]]:
"""
r"""
Compute the match accuracy of the predicted answer for a list of queries.
Args:
all_pred_answer (List[str]): List of predicted answer strings
Expand All @@ -69,7 +69,7 @@ def compute_match_acc(


class RetrieverEvaluator:
"""
r"""
Evaluator for evaluating the performance of a retriever.
"""

Expand All @@ -79,7 +79,7 @@ def __init__(self):
def compute_recall_single_query(
self, retrieved_context: str, gt_context: Union[str, List[str]]
) -> float:
"""
r"""
Compute the recall of the retrieved context for a single query.
Args:
retrieved_context (str): Retrieved context string
Expand All @@ -100,8 +100,8 @@ def compute_recall(
all_retrieved_context: List[str],
all_gt_context: Union[List[str], List[List[str]]],
) -> Tuple[float, List[float]]:
"""
Compute the recall of the retrieved context for a list of queries.
r"""
Compute the recall of the retrieved context for a list of queries. The recall is the ratio of the number of relevant context strings in the retrieved context to the total number of relevant context strings.
Args:
all_retrieved_context (List[str]): List of retrieved context strings
all_gt_context (Union[List[str], List[List[str]]]: List of ground truth context strings and each of them can be a string or a list of strings
Expand All @@ -119,7 +119,7 @@ def compute_recall(
def compute_context_relevance_single_query(
self, retrieved_context: str, gt_context: Union[str, List[str]]
) -> float:
"""
r"""
Compute the context relevance of the retrieved context for a single query. The context relevance is the ratio of the number of relevant context tokens in the retrieved context to the total number of tokens in the retrieved context.
Args:
retrieved_context (str): Retrieved context string
Expand All @@ -141,7 +141,7 @@ def compute_context_relevance(
all_retrieved_context: List[str],
all_gt_context: Union[List[str], List[List[str]]],
) -> Tuple[float, List[float]]:
"""
r"""
Compute the context relevance of the retrieved context for a list of queries. The context relevance is the ratio of the number of relevant context tokens in the retrieved context to the total number of tokens in the retrieved context.
Args:
all_retrieved_context (List[str]): List of retrieved context strings
Expand All @@ -164,7 +164,7 @@ def compute_context_relevance(


class LLMasJudge:
"""
r"""
LLM as judge for evaluating the performance of a LLM.
Args:
Expand All @@ -177,7 +177,7 @@ def __init__(self, llm_evaluator: Generator):
def compute_judgement_single_question(
self, question: str, pred_answer: str, gt_answer: str, judgement_query: str
) -> bool:
"""
r"""
Get the judgement of the predicted answer for a single question.
Args:
question (str): Question string
Expand All @@ -204,7 +204,7 @@ def compute_judgement(
all_gt_answer: List[str],
judgement_query: str,
) -> List[bool]:
"""
r"""
Get the judgement of the predicted answer for a list of questions.
Args:
all_questions (List[str]): List of question strings
Expand Down
4 changes: 2 additions & 2 deletions use_cases/rag_hotpotqa.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

from core.string_parser import JsonParser
from core.component import Sequential
from eval.evaluator import (
from eval.evaluators import (
RetrieverEvaluator,
AnswerMacthEvaluator,
LLMasJudge,
Expand Down Expand Up @@ -111,7 +111,7 @@ def get_supporting_sentences(
)
print(f"Answer match accuracy: {answer_match_acc}")
print(f"Match accuracy for each query: {match_acc_list}")
# Evaluate the generator using LLM as judge. We use GPT-4 as the judge here.
# Evaluate the generator using LLM as judge.
# The task description and the judgement query can be customized.
llm_evaluator = Generator(
model_client=OpenAIClient(),
Expand Down

0 comments on commit 4f958ed

Please sign in to comment.