diff --git a/content/20.results.md b/content/20.results.md index 4a932ca..18a624c 100644 --- a/content/20.results.md +++ b/content/20.results.md @@ -57,8 +57,8 @@ In addition, to facilitate the scaling of prompt engineering, we integrate this ### Benchmarking The increasing generality of LLMs poses challenges for their comprehensive evaluation. -To circumvent this issue, we focus on specific biomedical tasks and datasets. -For advanced assessment, we employ automated validation of the model's responses by a second LLM. +Specifically, their ability to aid in a multitude of tasks and their great freedom in formatting the answers challenge their evaluation by traditional methods. +To circumvent this issue, we focus on specific biomedical tasks and datasets and employ automated validation of the model's responses by a second LLM for advanced assessments. For transparent and reproducible evaluation of LLMs, we implement a benchmarking framework that allows the comparison of models, prompt sets, and all other components of the pipeline. The generic Pytest framework [@pytest] allows for the automated evaluation of a matrix of all possible combinations of components. The results are stored and displayed on our website for simple comparison, and the benchmark is updated upon the release of new models and extensions to the datasets and BioChatter capabilities ([https://biochatter.org/benchmark/](https://biochatter.org/benchmark/)).