BioChatter (https://github.com/biocypher/biochatter) is a Python framework that provides an easy-to-use interface to interact with LLMs and auxiliary technologies via an intuitive API (application programming interface). This way, its functionality can be integrated into any number of user interfaces, such as web apps, command-line interfaces, or Jupyter notebooks (Figure @fig:architecture).
The framework is designed to be modular: any of its components can be exchanged with other implementations (Figure @fig:overview). These functionalities include:
-
basic question-answering with LLMs hosted by providers (such as OpenAI) as well as locally deployed open-source models
-
reproducible prompt engineering to guide the LLM towards a specific task or behaviour
-
knowledge graph (KG) querying with automatic integration of any KG created in the BioCypher framework [@biocypher]
-
retrieval-augmented generation (RAG) using vector database embeddings of user-provided literature
-
model chaining to orchestrate multiple LLMs and other models in a single conversation using the LangChain framework [@langchain]
-
fact-checking of LLM responses using a second LLM
-
benchmarking of LLMs, prompts, and other components
In the following, we briefly describe these components, which are demonstrated in our web apps (https://chat.biocypher.org).
The core functionality of BioChatter is to interact with LLMs. The framework supports both leading proprietary models such as the GPT series from OpenAI as well as open-source models such as LLaMA2 [@doi:10.48550/arXiv.2307.09288] and Mixtral 8x7B [@doi:10.48550/arXiv.2401.04088] via a flexible open-source deployment framework [@{https://github.com/xorbitsai/inference}] (see Methods). Currently, the most powerful conversational AI platform, ChatGPT (OpenAI), is surrounded by data privacy concerns [@{https://www.reuters.com/technology/european-data-protection-board-discussing-ai-policy-thursday-meeting-2023-04-13/}]. To address this issue, we provide access to the different OpenAI models through their API, which is subject to different, more stringent data protection than the web interface [@{https://openai.com/policies/terms-of-use}], most importantly by disallowing reuse of user inputs for subsequent model training. Further, we aim to preferentially support open-source LLMs to facilitate more transparency in their application and increase data privacy by being able to run a model locally on dedicated hardware and end-user devices [@doi:10.1038/d41586-023-01295-4]. By building on LangChain [@langchain], we support dozens of LLM providers, such as the Xorbits Inference and Hugging Face APIs [@{https://github.com/xorbitsai/inference}], which can be used to query any of the more than 100 000 open-source models on Hugging Face Hub [@{https://huggingface.co/docs/hub/index}], for instance those on its LLM leaderboard [@{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}]. Although OpenAI’s models currently outperform any alternatives in terms of both LLM performance and API convenience, we expect many open-source developments in this area in the future [@biollmbench]. Therefore, we support plug-and-play exchange of models to enhance biomedical AI readiness, and we implement a bespoke benchmarking framework for the biomedical application of LLMs.
An essential property of LLMs is their sensitivity to the prompt, i.e., the initial input that guides the model towards a specific task or behaviour. Prompt engineering is an emerging discipline of practical AI, and as such, there are no established best practices [@doi:10.48550/arXiv.2302.11382;@doi:10.48550/arXiv.2312.16171]. Current approaches are mostly trial-and-error-based manual engineering, which is not reproducible and changes with every new model [@biollmbench]. To address this issue, we include a prompt engineering framework in BioChatter that allows the preservation of prompt sets for specific tasks, which can be shared and reused by the community. In addition, to facilitate the scaling of prompt engineering, we integrate this framework into the benchmarking pipeline, which enables the automated evaluation of prompt sets as new models are published.
KGs are a powerful tool to represent and query knowledge in a structured manner. With BioCypher [@biocypher], we have developed a framework to create KGs from biomedical data in a user-friendly way while also semantically grounding the data in ontologies. BioChatter is an extension of the BioCypher ecosystem, elevating its user-friendliness further by allowing natural language interactions with the data; any BioCypher KG is automatically compatible with BioChatter. We use information generated in the build process of BioCypher KGs to tune BioChatter's understanding of the data structures and contents, thereby increasing the efficiency of LLM-based KG querying (see Methods). In addition, the ability to connect to any BioCypher KG allows the integration of prior knowledge into the LLM's retrieval, which can be used to ground the model's responses in the context of the KG via in-context learning / retrieval-augmented generation, which can facilitate human-AI interaction via symbolic concepts [@doi:10.1609/aaai.v36i11.21488]. We demonstrate the user experience of KG-driven interaction in [Supplementary Note 1: Knowledge Graph Retrieval-Augmented Generation] and on our website (https://biochatter.org/vignette-kg/).
LLM confabulation is a major issue for biomedical applications, where the consequences of incorrect information can be severe. One popular way of addressing this issue is to apply "in-context learning," which is also more recently referred to as "retrieval-augmented generation" (RAG) [@doi:10.48550/arxiv.2303.17580]. Briefly, RAG relies on injection of information into the model prompt of a pre-trained model and, as such, does not require retraining / fine-tuning; once created, any RAG prompt can be used with any LLM. While this can be done by processing structured knowledge, for instance, from KGs, it is often more efficient to use a semantic search engine to retrieve relevant information from unstructured data sources such as literature. By incorporating the management and integration of vector databases in the BioChatter framework, we allow the user to connect to a vector database, embed an arbitrary number of documents, and then use semantic search to improve the model prompts by adding text fragments relevant to the given question (see Methods). We demonstrate the user experience of RAG in [Supplementary Note 2: Retrieval-Augmented Generation] and on our website (https://biochatter.org/vignette-rag/).
LLMs cannot only seamlessly interact with human users, but also with other LLMs as well as many other types of models. They understand API calls and can therefore theoretically orchestrate complex multi-step tasks [@doi:10.48550/arXiv.2305.15334;@doi:10.48550/arXiv.2308.11432]. However, implementation is not trivial and the complex process can lead to unpredictable behaviours. We aim to improve the stability of model chaining in biomedical applications by developing bespoke approaches for common biomedical tasks, such as interpretation and design of experiments, evaluating literature, and exploring web resources. While we focus on reusing existing open-source frameworks such as LangChain [@langchain], we also develop bespoke solutions where necessary to provide stability for the given application. As an example, we implemented a fact-checking module that uses a second LLM to evaluate the factual correctness of the primary LLM's responses continuously during the conversation (see Methods).
The increasing generality of LLMs poses challenges for their comprehensive evaluation. Specifically, their ability to aid in a multitude of tasks and their great freedom in formatting the answers challenge their evaluation by traditional methods. To circumvent this issue, we focus on specific biomedical tasks and datasets and employ automated validation of the model's responses by a second LLM for advanced assessments. For transparent and reproducible evaluation of LLMs, we implement a benchmarking framework that allows the comparison of models, prompt sets, and all other components of the pipeline. The generic Pytest framework [@pytest] allows for the automated evaluation of a matrix of all possible combinations of components. The results are stored and displayed on our website for simple comparison, and the benchmark is updated upon the release of new models and extensions to the datasets and BioChatter capabilities (https://biochatter.org/benchmark/).
Since the biomedical domain has its own tasks and requirements [@biollmbench], we created a bespoke benchmark that allows us to be more precise in the evaluation of components. This is complementary to the existing, general-purpose benchmarks and leaderboards for LLMs [@doi:10.1038/s41586-023-06291-2;@{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard};@{https://crfm.stanford.edu/helm/lite/latest/}]. Furthermore, to prevent leakage of the benchmark data into the training data of the models, a known issue in the general-purpose benchmarks [@doi:10.48550/arXiv.2310.18018], we implemented an encrypted pipeline that contains the benchmark datasets and is only accessible to the workflow that executes the benchmark (see Methods).
Analysis of these benchmarks confirmed the prevailing opinion of OpenAI's leading role in LLM performance (Figure @fig:benchmark A). Since the benchmark datasets were created to specifically cover functions relevant in BioChatter's application domain, the benchmark results are primarily a measure of the LLMs' usefulness in our applications. OpenAI's GPT models (gpt-4 and gpt-3.5-turbo) lead by some margin on overall performance and consistency, but several open-source models reach high performance in specific tasks. Of note, while the newer version (0125) of gpt-3.5-turbo outperforms the previous version (0613) of gpt-4, version 0125 of gpt-4 shows a significant drop in performance. The performance of open-source models appears to depend on their quantisation level, i.e., the bit-precision used to represent the model's parameters. For models that offer quantisation options, performance apparently plateaus or even decreases after the 4- or 5-bit mark (Figure @fig:benchmark A). There is no apparent correlation between model size and performance (Pearson's r = 0.171, p < 0.001).
To evaluate the benefit of BioChatter functionality, we compared the performance of models with and without the use of BioChatter's prompt engine for KG querying. The models without prompt engine still have access to the BioCypher schema definition, which details the KG structure, but they do not use the multi-step procedure available through BioChatter. Consequently, the models without prompt engine show a lower performance in creating correct queries than the same models with prompt engine (0.444±0.11 vs. 0.818±0.11, unpaired t-test P < 0.001, Figure @fig:benchmark B).