Documentation for eval

SylphAI-Inc · May 19, 2024 · 4f958ed · 4f958ed
1 parent 19d089e
commit 4f958ed
Show file tree

Hide file tree

Showing 9 changed files with 63 additions and 21 deletions.
diff --git a/docs/source/apis/eval/evaluators.rst b/docs/source/apis/eval/evaluators.rst
@@ -0,0 +1,12 @@
+.. _evaluators:
+
+LightRAG.eval
+=========================
+
+eval.evaluators
+------------------------------------
+
+.. automodule:: eval.evaluators
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/source/get_started/installation.rst b/docs/source/get_started/installation.rst
@@ -11,4 +11,4 @@ To start with LightRAG, please follow the steps:
 
 4. (For contributors only) Install pre-commit into your git hooks using ``pre-commit install``, which will automatically check the code standard on every commit.
 
-5. Now you should run any file in the repo.
+5. Now you should be able to run any file in the repo.
diff --git a/docs/source/get_started/introduction.rst b/docs/source/get_started/introduction.rst
@@ -15,7 +15,7 @@ What is LightRAG?
 
 LightRAG comes from the best of the AI research and engineering.
 Fundamentally, we ask ourselves: what kind of system that combines the
-best of research(such as LLM), engineering (such as ‘jinja’) to build
+best of research (such as LLM), engineering (such as ‘jinja’) to build
 the best applications? We are not a framework. We do not want you to
 directly install the package. We want you to carefully decide to take
 modules and structures from here to build your own library and

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -30,19 +30,21 @@ LightRAG comes from the best of the AI research and engineering. Fundamentally,
    :caption: Tutorials
 
    tutorials/simpleQA
+   tutorials/eval_and_metrics
+   tutorials/in_context_learning
 
 .. toctree::
    :maxdepth: 1
    :caption: API Reference
 
    apis/components/components
    apis/core/core
+   apis/eval/evaluators
 
 .. toctree::
    :glob:
    :maxdepth: 1
    :caption: Resources
-   
+
    resources/resources
    resources/contributing
-
diff --git a/docs/source/resources/contributing.rst b/docs/source/resources/contributing.rst
@@ -44,15 +44,15 @@ To effectively edit the LightRAG documentation, you have several options dependi
 
 Locate the ``.rst`` file you want to edit within the ``docs/source`` directory. You can modify the content directly in any text editor. We are using `reStructuredText <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html>`_ as the language. For formatting help, refer to the reStructuredText Quickstart Guide:
 
-- `Quickstart <https://docutils.sourceforge.io/docs/user/rst/quickstart.html>`
-- `reStructuredText Markup Specification <https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html>`
+- `Quickstart <https://docutils.sourceforge.io/docs/user/rst/quickstart.html>`_
+- `reStructuredText Markup Specification <https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html>`_
 
 **Create a New .rst File**
 
 If you need to add a new section or topic:
 
 - Create a new ``.rst`` file in the appropriate subdirectory within ``docs/source``.
-- Write your content following `reStructuredText syntax <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html>`.
+- Write your content following `reStructuredText syntax <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html>`_.
 - If you are creating a new section, ensure to include your new file in the relevant ``toctree`` located usually in ``index.rst`` or within the closest parent ``.rst`` file, to make it appear in the compiled documentation.
 
 **Convert a Markdown File to .rst Using Pandoc**

diff --git a/docs/source/tutorials/eval_and_metrics.rst b/docs/source/tutorials/eval_and_metrics.rst
@@ -0,0 +1,25 @@
+Evaluation and Metrics
+=========
+
+Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. In LightRAG, we provide a set of metrics in :ref:`our evaluators <evaluators>`. In this tutorial, we will show how to use them to evaluate the performance of the retriever and generator components of a RAG pipeline
+
+
+Evaluating a RAG Pipeline
+---------------------------------------
+The full code for this tutorial can be found in `use_cases/rag_hotpotqa.py <https://github.com/SylphAI-Inc/LightRAG/blob/main/use_cases/rag_hotpotqa.py>`_.
+
+RAG (Retrieval-Augmented Generation) pipelines leverage a retriever to fetch relevant context from a knowledge base (e.g., a document database) which is then fed to an LLM generator with the query to produce the answer. This allows the model to generate more contextually relevant answers.
+
+Thus, to evaluate a RAG pipeline, we can assess both the quality of the retrieved context and the quality of the final generated answer. Speciafically, we can use the following evaluators and their corresponding metrics.
+
+* :class:`RetrieverEvaluator <eval.evaluators.RetrieverEvaluator>`: This evaluator is used to evaluate the performance of the retriever component of the RAG pipeline. It has the following metric functions:
+    * :obj:`compute_recall`: This function computes the recall of the retriever. It is defined as the number of relevant strings retrieved by the retriever divided by the total number of relevant strings in the knowledge base.
+    * :obj:`compute_context_relevance`: This function computes the relevance of the retrieved context. It is defined as the ratio of the number of relevant context tokens in the retrieved context to the total number of tokens in the retrieved context.
+* :class:`AnswerMacthEvaluator <eval.evaluators.AnswerMacthEvaluator>`: This evaluator is used to evaluate the performance of the generator component of the RAG pipeline. It has the following metric functions:
+    * :obj:`compute_match_acc (if type is 'exact_match')`: This function computes the exact match accuracy of the generated answer. It is defined as the number of generated answers that exactly match the ground truth answer divided by the total number of generated answers.
+    * :obj:`compute_match_acc (if type is 'fuzzy_match')`: This function computes the fuzzy match accuracy of the generated answer. It is defined as the number of generated answers that contain the ground truth answer divided by the total number of generated answers.
+* :class:`LLMasJudge <eval.evaluators.LLMasJudge>`: This evaluator uses an LLM to get the judgement of the predicted answer for a list of questions. The task description and the judgement query of the LLM judge can be customized.
+    * :obj:`compute_judgement`: This function computes the judgement of the predicted answer. It is defined as the number of generated answers that are judged as correct by the LLM divided by the total number of generated answers.
+
+
+TODO
diff --git a/docs/source/tutorials/in_context_learning.rst b/docs/source/tutorials/in_context_learning.rst
@@ -0,0 +1,3 @@
+In-Context Learning
+=========
+TODO
diff --git a/eval/evaluator.py → eval/evaluators.py b/eval/evaluator.py → eval/evaluators.py
@@ -23,7 +23,7 @@
 
 
 class AnswerMacthEvaluator:
-    """
+    r"""
     Evaluator for evaluating the match between predicted answer and ground truth answer.
     Args:
         type (str): Type of matching evaluation. Can be "exact_match" or "fuzzy_match". "exact_match" requires the predicted answer to be exactly the same as the ground truth answer. "fuzzy_match" requires the predicted answer to contain the ground truth answer.
@@ -33,7 +33,7 @@ def __init__(self, type: str = "exact_match"):
         self.type = type
 
     def compute_match_acc_single_query(self, pred_answer: str, gt_answer: str) -> float:
-        """
+        r"""
         Compute the match accuracy of the predicted answer for a single query.
         Args:
             pred_answer (str): Predicted answer string
@@ -51,7 +51,7 @@ def compute_match_acc_single_query(self, pred_answer: str, gt_answer: str) -> fl
     def compute_match_acc(
         self, all_pred_answer: List[str], all_gt_answer: List[str]
     ) -> Tuple[float, List[float]]:
-        """
+        r"""
         Compute the match accuracy of the predicted answer for a list of queries.
         Args:
             all_pred_answer (List[str]): List of predicted answer strings
@@ -69,7 +69,7 @@ def compute_match_acc(
 
 
 class RetrieverEvaluator:
-    """
+    r"""
     Evaluator for evaluating the performance of a retriever.
     """
 
@@ -79,7 +79,7 @@ def __init__(self):
     def compute_recall_single_query(
         self, retrieved_context: str, gt_context: Union[str, List[str]]
     ) -> float:
-        """
+        r"""
         Compute the recall of the retrieved context for a single query.
         Args:
             retrieved_context (str): Retrieved context string
@@ -100,8 +100,8 @@ def compute_recall(
         all_retrieved_context: List[str],
         all_gt_context: Union[List[str], List[List[str]]],
     ) -> Tuple[float, List[float]]:
-        """
-        Compute the recall of the retrieved context for a list of queries.
+        r"""
+        Compute the recall of the retrieved context for a list of queries. The recall is the ratio of the number of relevant context strings in the retrieved context to the total number of relevant context strings.
         Args:
             all_retrieved_context (List[str]): List of retrieved context strings
             all_gt_context (Union[List[str], List[List[str]]]: List of ground truth context strings and each of them can be a string or a list of strings
@@ -119,7 +119,7 @@ def compute_recall(
     def compute_context_relevance_single_query(
         self, retrieved_context: str, gt_context: Union[str, List[str]]
     ) -> float:
-        """
+        r"""
         Compute the context relevance of the retrieved context for a single query. The context relevance is the ratio of the number of relevant context tokens in the retrieved context to the total number of tokens in the retrieved context.
         Args:
             retrieved_context (str): Retrieved context string
@@ -141,7 +141,7 @@ def compute_context_relevance(
         all_retrieved_context: List[str],
         all_gt_context: Union[List[str], List[List[str]]],
     ) -> Tuple[float, List[float]]:
-        """
+        r"""
         Compute the context relevance of the retrieved context for a list of queries. The context relevance is the ratio of the number of relevant context tokens in the retrieved context to the total number of tokens in the retrieved context.
         Args:
             all_retrieved_context (List[str]): List of retrieved context strings
@@ -164,7 +164,7 @@ def compute_context_relevance(
 
 
 class LLMasJudge:
-    """
+    r"""
     LLM as judge for evaluating the performance of a LLM.
 
     Args:
@@ -177,7 +177,7 @@ def __init__(self, llm_evaluator: Generator):
     def compute_judgement_single_question(
         self, question: str, pred_answer: str, gt_answer: str, judgement_query: str
     ) -> bool:
-        """
+        r"""
         Get the judgement of the predicted answer for a single question.
         Args:
             question (str): Question string
@@ -204,7 +204,7 @@ def compute_judgement(
         all_gt_answer: List[str],
         judgement_query: str,
     ) -> List[bool]:
-        """
+        r"""
         Get the judgement of the predicted answer for a list of questions.
         Args:
             all_questions (List[str]): List of question strings

diff --git a/use_cases/rag_hotpotqa.py b/use_cases/rag_hotpotqa.py
@@ -10,7 +10,7 @@
 
 from core.string_parser import JsonParser
 from core.component import Sequential
-from eval.evaluator import (
+from eval.evaluators import (
     RetrieverEvaluator,
     AnswerMacthEvaluator,
     LLMasJudge,
@@ -111,7 +111,7 @@ def get_supporting_sentences(
     )
     print(f"Answer match accuracy: {answer_match_acc}")
     print(f"Match accuracy for each query: {match_acc_list}")
-    # Evaluate the generator using LLM as judge. We use GPT-4 as the judge here.
+    # Evaluate the generator using LLM as judge.
     # The task description and the judgement query can be customized.
     llm_evaluator = Generator(
         model_client=OpenAIClient(),
Original file line number	Diff line number	Diff line change
Expand Up		@@ -11,4 +11,4 @@ To start with LightRAG, please follow the steps:

		4. (For contributors only) Install pre-commit into your git hooks using ``pre-commit install``, which will automatically check the code standard on every commit.

		5. Now you should run any file in the repo.
		5. Now you should be able to run any file in the repo.