diff --git a/docs/source/apis/eval/evaluators.rst b/docs/source/apis/eval/evaluators.rst new file mode 100644 index 000000000..de2c9304d --- /dev/null +++ b/docs/source/apis/eval/evaluators.rst @@ -0,0 +1,12 @@ +.. _evaluators: + +LightRAG.eval +========================= + +eval.evaluators +------------------------------------ + +.. automodule:: eval.evaluators + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/get_started/installation.rst b/docs/source/get_started/installation.rst index 21697c4c3..f1f021cde 100644 --- a/docs/source/get_started/installation.rst +++ b/docs/source/get_started/installation.rst @@ -11,4 +11,4 @@ To start with LightRAG, please follow the steps: 4. (For contributors only) Install pre-commit into your git hooks using ``pre-commit install``, which will automatically check the code standard on every commit. -5. Now you should run any file in the repo. \ No newline at end of file +5. Now you should be able to run any file in the repo. diff --git a/docs/source/get_started/introduction.rst b/docs/source/get_started/introduction.rst index b906c3162..4da292241 100644 --- a/docs/source/get_started/introduction.rst +++ b/docs/source/get_started/introduction.rst @@ -15,7 +15,7 @@ What is LightRAG? LightRAG comes from the best of the AI research and engineering. Fundamentally, we ask ourselves: what kind of system that combines the -best of research(such as LLM), engineering (such as ‘jinja’) to build +best of research (such as LLM), engineering (such as ‘jinja’) to build the best applications? We are not a framework. We do not want you to directly install the package. We want you to carefully decide to take modules and structures from here to build your own library and diff --git a/docs/source/index.rst b/docs/source/index.rst index 21b903c6a..b24b2f722 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -30,6 +30,8 @@ LightRAG comes from the best of the AI research and engineering. Fundamentally, :caption: Tutorials tutorials/simpleQA + tutorials/eval_and_metrics + tutorials/in_context_learning .. toctree:: :maxdepth: 1 @@ -37,12 +39,12 @@ LightRAG comes from the best of the AI research and engineering. Fundamentally, apis/components/components apis/core/core + apis/eval/evaluators .. toctree:: :glob: :maxdepth: 1 :caption: Resources - + resources/resources resources/contributing - diff --git a/docs/source/resources/contributing.rst b/docs/source/resources/contributing.rst index 3b79095e5..c39b0623c 100644 --- a/docs/source/resources/contributing.rst +++ b/docs/source/resources/contributing.rst @@ -44,15 +44,15 @@ To effectively edit the LightRAG documentation, you have several options dependi Locate the ``.rst`` file you want to edit within the ``docs/source`` directory. You can modify the content directly in any text editor. We are using `reStructuredText `_ as the language. For formatting help, refer to the reStructuredText Quickstart Guide: -- `Quickstart ` -- `reStructuredText Markup Specification ` +- `Quickstart `_ +- `reStructuredText Markup Specification `_ **Create a New .rst File** If you need to add a new section or topic: - Create a new ``.rst`` file in the appropriate subdirectory within ``docs/source``. -- Write your content following `reStructuredText syntax `. +- Write your content following `reStructuredText syntax `_. - If you are creating a new section, ensure to include your new file in the relevant ``toctree`` located usually in ``index.rst`` or within the closest parent ``.rst`` file, to make it appear in the compiled documentation. **Convert a Markdown File to .rst Using Pandoc** diff --git a/docs/source/tutorials/eval_and_metrics.rst b/docs/source/tutorials/eval_and_metrics.rst new file mode 100644 index 000000000..2e36385c2 --- /dev/null +++ b/docs/source/tutorials/eval_and_metrics.rst @@ -0,0 +1,25 @@ +Evaluation and Metrics +========= + +Evaluating an LLM application essentially involves various metric functions. You can write your own metric functions or import from other libraries. In LightRAG, we provide a set of metrics in :ref:`our evaluators `. In this tutorial, we will show how to use them to evaluate the performance of the retriever and generator components of a RAG pipeline + + +Evaluating a RAG Pipeline +--------------------------------------- +The full code for this tutorial can be found in `use_cases/rag_hotpotqa.py `_. + +RAG (Retrieval-Augmented Generation) pipelines leverage a retriever to fetch relevant context from a knowledge base (e.g., a document database) which is then fed to an LLM generator with the query to produce the answer. This allows the model to generate more contextually relevant answers. + +Thus, to evaluate a RAG pipeline, we can assess both the quality of the retrieved context and the quality of the final generated answer. Speciafically, we can use the following evaluators and their corresponding metrics. + +* :class:`RetrieverEvaluator `: This evaluator is used to evaluate the performance of the retriever component of the RAG pipeline. It has the following metric functions: + * :obj:`compute_recall`: This function computes the recall of the retriever. It is defined as the number of relevant strings retrieved by the retriever divided by the total number of relevant strings in the knowledge base. + * :obj:`compute_context_relevance`: This function computes the relevance of the retrieved context. It is defined as the ratio of the number of relevant context tokens in the retrieved context to the total number of tokens in the retrieved context. +* :class:`AnswerMacthEvaluator `: This evaluator is used to evaluate the performance of the generator component of the RAG pipeline. It has the following metric functions: + * :obj:`compute_match_acc (if type is 'exact_match')`: This function computes the exact match accuracy of the generated answer. It is defined as the number of generated answers that exactly match the ground truth answer divided by the total number of generated answers. + * :obj:`compute_match_acc (if type is 'fuzzy_match')`: This function computes the fuzzy match accuracy of the generated answer. It is defined as the number of generated answers that contain the ground truth answer divided by the total number of generated answers. +* :class:`LLMasJudge `: This evaluator uses an LLM to get the judgement of the predicted answer for a list of questions. The task description and the judgement query of the LLM judge can be customized. + * :obj:`compute_judgement`: This function computes the judgement of the predicted answer. It is defined as the number of generated answers that are judged as correct by the LLM divided by the total number of generated answers. + + +TODO diff --git a/docs/source/tutorials/in_context_learning.rst b/docs/source/tutorials/in_context_learning.rst new file mode 100644 index 000000000..ff3c3592a --- /dev/null +++ b/docs/source/tutorials/in_context_learning.rst @@ -0,0 +1,3 @@ +In-Context Learning +========= +TODO diff --git a/eval/evaluator.py b/eval/evaluators.py similarity index 96% rename from eval/evaluator.py rename to eval/evaluators.py index 9118c5003..3b50baee5 100644 --- a/eval/evaluator.py +++ b/eval/evaluators.py @@ -23,7 +23,7 @@ class AnswerMacthEvaluator: - """ + r""" Evaluator for evaluating the match between predicted answer and ground truth answer. Args: type (str): Type of matching evaluation. Can be "exact_match" or "fuzzy_match". "exact_match" requires the predicted answer to be exactly the same as the ground truth answer. "fuzzy_match" requires the predicted answer to contain the ground truth answer. @@ -33,7 +33,7 @@ def __init__(self, type: str = "exact_match"): self.type = type def compute_match_acc_single_query(self, pred_answer: str, gt_answer: str) -> float: - """ + r""" Compute the match accuracy of the predicted answer for a single query. Args: pred_answer (str): Predicted answer string @@ -51,7 +51,7 @@ def compute_match_acc_single_query(self, pred_answer: str, gt_answer: str) -> fl def compute_match_acc( self, all_pred_answer: List[str], all_gt_answer: List[str] ) -> Tuple[float, List[float]]: - """ + r""" Compute the match accuracy of the predicted answer for a list of queries. Args: all_pred_answer (List[str]): List of predicted answer strings @@ -69,7 +69,7 @@ def compute_match_acc( class RetrieverEvaluator: - """ + r""" Evaluator for evaluating the performance of a retriever. """ @@ -79,7 +79,7 @@ def __init__(self): def compute_recall_single_query( self, retrieved_context: str, gt_context: Union[str, List[str]] ) -> float: - """ + r""" Compute the recall of the retrieved context for a single query. Args: retrieved_context (str): Retrieved context string @@ -100,8 +100,8 @@ def compute_recall( all_retrieved_context: List[str], all_gt_context: Union[List[str], List[List[str]]], ) -> Tuple[float, List[float]]: - """ - Compute the recall of the retrieved context for a list of queries. + r""" + Compute the recall of the retrieved context for a list of queries. The recall is the ratio of the number of relevant context strings in the retrieved context to the total number of relevant context strings. Args: all_retrieved_context (List[str]): List of retrieved context strings all_gt_context (Union[List[str], List[List[str]]]: List of ground truth context strings and each of them can be a string or a list of strings @@ -119,7 +119,7 @@ def compute_recall( def compute_context_relevance_single_query( self, retrieved_context: str, gt_context: Union[str, List[str]] ) -> float: - """ + r""" Compute the context relevance of the retrieved context for a single query. The context relevance is the ratio of the number of relevant context tokens in the retrieved context to the total number of tokens in the retrieved context. Args: retrieved_context (str): Retrieved context string @@ -141,7 +141,7 @@ def compute_context_relevance( all_retrieved_context: List[str], all_gt_context: Union[List[str], List[List[str]]], ) -> Tuple[float, List[float]]: - """ + r""" Compute the context relevance of the retrieved context for a list of queries. The context relevance is the ratio of the number of relevant context tokens in the retrieved context to the total number of tokens in the retrieved context. Args: all_retrieved_context (List[str]): List of retrieved context strings @@ -164,7 +164,7 @@ def compute_context_relevance( class LLMasJudge: - """ + r""" LLM as judge for evaluating the performance of a LLM. Args: @@ -177,7 +177,7 @@ def __init__(self, llm_evaluator: Generator): def compute_judgement_single_question( self, question: str, pred_answer: str, gt_answer: str, judgement_query: str ) -> bool: - """ + r""" Get the judgement of the predicted answer for a single question. Args: question (str): Question string @@ -204,7 +204,7 @@ def compute_judgement( all_gt_answer: List[str], judgement_query: str, ) -> List[bool]: - """ + r""" Get the judgement of the predicted answer for a list of questions. Args: all_questions (List[str]): List of question strings diff --git a/use_cases/rag_hotpotqa.py b/use_cases/rag_hotpotqa.py index 7fc43cd11..3cce62d91 100644 --- a/use_cases/rag_hotpotqa.py +++ b/use_cases/rag_hotpotqa.py @@ -10,7 +10,7 @@ from core.string_parser import JsonParser from core.component import Sequential -from eval.evaluator import ( +from eval.evaluators import ( RetrieverEvaluator, AnswerMacthEvaluator, LLMasJudge, @@ -111,7 +111,7 @@ def get_supporting_sentences( ) print(f"Answer match accuracy: {answer_match_acc}") print(f"Match accuracy for each query: {match_acc_list}") - # Evaluate the generator using LLM as judge. We use GPT-4 as the judge here. + # Evaluate the generator using LLM as judge. # The task description and the judgement query can be customized. llm_evaluator = Generator( model_client=OpenAIClient(),