Skip to content

Commit

Permalink
explain challenges briefly
Browse files Browse the repository at this point in the history
  • Loading branch information
slobentanzer committed Feb 14, 2024
1 parent fc9e021 commit 0547deb
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions content/20.results.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,8 @@ In addition, to facilitate the scaling of prompt engineering, we integrate this
### Benchmarking

The increasing generality of LLMs poses challenges for their comprehensive evaluation.
To circumvent this issue, we focus on specific biomedical tasks and datasets.
For advanced assessment, we employ automated validation of the model's responses by a second LLM.
Specifically, their ability to aid in a multitude of tasks and their great freedom in formatting the answers challenge their evaluation by traditional methods.
To circumvent this issue, we focus on specific biomedical tasks and datasets and employ automated validation of the model's responses by a second LLM for advanced assessments.
For transparent and reproducible evaluation of LLMs, we implement a benchmarking framework that allows the comparison of models, prompt sets, and all other components of the pipeline.
The generic Pytest framework [@pytest] allows for the automated evaluation of a matrix of all possible combinations of components.
The results are stored and displayed on our website for simple comparison, and the benchmark is updated upon the release of new models and extensions to the datasets and BioChatter capabilities ([https://biochatter.org/benchmark/](https://biochatter.org/benchmark/)).
Expand Down

0 comments on commit 0547deb

Please sign in to comment.