BioChatter (version 0.4.7 at the time of publication) is a Python library, supporting Python 3.10-3.12, which we ensure with a continuous integration pipeline on GitHub (https://github.com/biocypher/biochatter). We provide documentation at https://biochatter.org, including a tutorial and API reference. All packages are developed openly and according to modern standards of software development [@doi:10.1038/s41597-020-0486-7]; we use the permissive MIT licence to encourage downstream use and development. We include a code of conduct and contributor guidelines to offer accessibility and inclusivity to all who are interested in contributing to the framework.
To demonstrate basic and advanced use cases of the framework, we provide two web apps, BioChatter Light and BioChatter Next.
BioChatter Light is a web app based on the Streamlit framework (version 1.31.1, https://streamlit.io), which is written in Python and can be deployed locally or on a server (https://github.com/biocypher/biochatter-light). The ease with which Streamlit allows the creation of interactive web apps in pure Python enables rapid iteration and agile development of new features, with the tradeoff of limited customisation and scalability. This framework is suitable for rapid prototyping of bespoke solutions for specific use cases. For an up-to-date overview and preview of the current functionality of the platform, please visit the online preview.
BioChatter Next (https://github.com/biocypher/biochatter-next) is a modern web app with server-client architecture, based on the open-source template of ChatGPT-Next-Web (https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web). It is written combining Typescript and Python and uses Next.js (v13.4.9) for a sleek frontend and Flask (v3.0.0) as backend. It demonstrates the use of BioChatter in a modern web app, including full customisation and scalability and localisation in 18 languages. However, this comes at the cost of increased complexity and development time. To provide seamless integration of the BioChatter backend into existing frontend solutions, we provide the server implementation at https://github.com/biocypher/biochatter-server and as a Docker image in our Docker Hub organisation (https://hub.docker.com/repository/docker/biocypher/biochatter-server).
We invite all interested researchers to select the framework that best suits their needs, or use the BioChatter server or library in their existing solutions.
The benchmarking framework examines a matrix of component combinations using the parameterisation feature of Pytest [@pytest]. This implementation allows for the automated evaluation of all possible combinations of components, such as LLMs, prompts, and datasets. We performed the benchmarks on a MacBook Pro with an M3 Max chip with 40-core GPU and 128GB of RAM. As a default, we ran each test five times to account for the stochastic nature of LLMs. We generally set the temperature to the lowest value possible for each model to decrease fluctuation.
The Pytest matrix uses a hash-based system to evaluate whether a model-dataset combination has been run before. Briefly, the hash is calculated from the dictionary representation of the test parameters, and the test is skipped if the combination of hash and model name is already present in the database. This hashing optimises for efficiency by only running modified or newly added tests. The individual dimensions of the matrix are:
-
LLMs: Testing proprietary (OpenAI) and open-source models (commonly using the Xorbits Inference API and HuggingFace models) against the same set of tasks is the primary aim of our benchmarking framework. We facilitate the automation of testing by including a programmatic way of deploying open-source models.
-
prompts: Since model performance can dramatically rely on the used prompts, a set of prompts for each task with varying degrees of specificity and fixed as well as variable components is used to evaluate this variability.
-
datasets: We test various tasks using a set of datasets for each task in question-answer-style.
-
data processing: Some data processing steps can have great impact on the downstream performance of LLMs. For instance, we test the conversion of numbers (which LLMs are notoriously bad at handling) to categorical text (e.g., low, medium, high).
-
model quantisations: We test a set of quantisations for each model (where available) to account for the trade-off between model size, inference speed, and performance.
-
model parameters: Where suitable, we test a set of parameters for each model, such as "temperature," which determines the reproducibility of model responses.
-
integrations: We write dedicated tests for specific tasks that require integrations, for instance with knowledge graphs or vector databases.
-
stochasticity: To account for variability in model responses, we include a parameter to run each test multiple times and generate summary statistics.
-
sentiment and behaviour: To assess whether the models exhibit the desired behaviour patterns for each of the personas, we let a second LLM evaluate the responses based on a set of criteria, including professionalism and politeness.
The Pytest framework is implemented at https://github.com/biocypher/biochatter/blob/main/benchmark, and more information is available at https://biochatter.org/benchmarking. The benchmark is updated upon the release of new models and extensions to the datasets, and continuously available at https://biochatter.org/benchmark. We will run the benchmark on new models and variants (including fine-tuned models) upon requests from the community, which can be made on GitHub using our issue template (https://github.com/biocypher/biochatter/issues/new/choose). The living benchmark process is inspired by test-driven development, meaning test cases are created based on specific features or behaviors that are desired. When a model doesn't initially produce the optimal response, which is often the case, adjustments are made to various elements of the framework, including prompts or functions, to enhance the model's effectiveness. Monitoring the model's performance on these tests over time allows us to assess the framework's reliability and pinpoint areas that need improvement.
To prevent leakage of benchmarking data (and subsequent contamination of future LLMs), we implement an encryption routine on the benchmark datasets. The encryption is performed using a hybrid encryption scheme, where the data are encrypted with a symmetric key, which is in turn encrypted with an asymmetric key. The datasets are stored in a dedicated encrypted pipeline that is only accessible to the workflow that executes the benchmark. These processes are implemented at https://github.com/biocypher/llm-test-dataset and accessed from the benchmark procedure in BioChatter.
We utilise the close connection between BioChatter and the BioCypher framework [@biocypher] to integrate knowledge graph (KG) queries into the BioChatter API. In the BioCypher KG creation, we use a configuration file to map KG contents to ontology terms, including information about each of the entities. For instance, we detail the properties of a node and the source and target classes of an edge. Additionally, during the KG build process, we enrich this information and save it to a YAML file and, optionally, directly to the KG. This information is used by BioChatter to tune its understanding of the KG, which allows the LLM to query the KG more efficiently.
By understanding the context of the KG, the exact contents, and the exact spelling of all identifiers and properties, we effectively support the LLM in generating correct queries.
The query generation process is broken up into multiple steps by BioChatter: recognising entities and relationships according to the user's question, estimating properties to be used in the query, and generating a syntactically correct query in the query language of the database, based on the results from the previous steps and constraints given by the KG schema information.
This procedure is implemented in the prompts.py
module.
To evaluate the quality of this process, we dedicate a module in the benchmark to the query generation process with a range of questions and KG schemata.
To illustrate the usage of this feature, we provide a demonstration repository at https://github.com/biocypher/pole including a KG build procedure and an instance of BioChatter Light, which can be run using a single Docker Compose command.
The pole KG can also be used in conjunction with the BioChatter Next app by using the docker-compose.yaml
file to build the application locally.
A demonstration of this use case is available in [Supplementary Note 1: Knowledge Graph Retrieval-Augmented Generation] and on our website (https://biochatter.org/vignette-kg/).
While current LLMs possess extensive internal general knowledge, they may not know how to prioritise very specific scientific results, or they may not have had access to some research articles in their training data (e.g., due to their recency or licensing issues). To bridge this gap, we can provide additional information from relevant publications to the model via the prompt. However, we frequently cannot add entire publications to the prompt, since the input length of current models is still restricted; we need to isolate the information that is specifically relevant to the question given by the user. To find this information, we perform a semantic similarity search between the user’s question and the contents of user-provided scientific articles (or other texts). The most efficient way to do this mapping is by using a vector database [@doi:10.48550/arxiv.2308.07107].
The contextual background information provided by the user (e.g., by uploading a scientific article of prior work related to the experiment to be interpreted) is split into pieces suitable to be digested by the LLM, which are individually embedded by the model. These embeddings (represented by vectors) are used to store the text fragments in a vector database; the storage as vectors allows fast and efficient retrieval of similar entities via the comparison of individual vectors. For example, the two sentences “Amyloid beta levels are associated with Alzheimer’s Disease stage.” and “One of the most important clinical markers of AD progression is the amount of deposited A-beta 42.” would be closely associated in a vector database (given the embedding model is of sufficient quality, i.e., similar to GPT-3 or better), while traditional text-based similarity metrics probably would not identify them as highly similar.
By comparing the user’s question to prior knowledge in the vector database, we can extract the relevant pieces of information from the entire background. Even better, we can first use an LLM to generate an answer to the user's question and then use this answer to query the vector database for relevant information. Regardless of whether the initial answer is correct, it is likely that the "fake answer" is more semantically similar to the relevant pieces of information than the user's question [@doi:10.48550/arXiv.2308.07107]. Semantic search results (for instance, single sentences directly related to the topic of the question) are then sufficiently small to be added to the prompt. In this way, the model can learn from additional context without the need for retraining or fine-tuning. This method is sometimes described as in-context learning [@doi:10.48550/arxiv.2303.17580] or retrieval-augmented generation [@rag].
To provide access to this functionality in BioChatter, we implement classes for the connection to, and management of, vector database systems (in the vectorstore.py
module), and for performing semantic search on the vector database and injecting the results into the prompt (in the vectorstore_agent.py
module).
An analogous implementation for KG retrieval is available in the database_agent.py
module.
Both retrieval mechanisms are integrated and provided to the BioChatter API via the rag_agent.py
module.
To demonstrate the use of the API, we add a “Retrieval-Augmented Generation” tab to the preview apps that allows the upload of text documents to be added to a vector database, which then can be queried to add contextual information to the prompt sent to the primary model.
This contextual information is transparently displayed.
Since this functionality requires a connection to a vector database system, we provide connectivity to a Milvus service, including a way to start the service in conjunction with a BioCypher knowledge graph and the BioChatter Light app in one Docker Compose workflow.
An example use case of this functionality is available in [Supplementary Note 2: Retrieval-Augmented Generation] and on our website (https://biochatter.org/vignette-rag/).
To facilitate access to open-source models, we adopt a flexible deployment framework based on the Xorbits Inference API [@{https://github.com/xorbitsai/inference}]. Xorbits Inference includes a large number of open-source models out of the box, and new models from Hugging Face Hub [@{https://huggingface.co/}] can be added using the intuitive graphical user interface. We used Xorbits Inference version 0.8.4 to deploy the benchmarked models, and we provide a Docker Compose repository to deploy the app on a Linux server with Nvidia GPUs (https://github.com/biocypher/xinference-docker-builtin/). This Compose uses the multi-architecture image (for ARM64 and AMD64 chips) we provide on our Docker Hub organisation (https://hub.docker.com/repository/docker/biocypher/xinference-builtin). On Mac OS with Apple Silicon chips, Docker does not have access to the GPU driver, and as such, Xinference needs to be deployed natively.
The ability of LLMs to control external software, including other LLMs, opens up a wide range of possibilities for the orchestration of complex tasks. A simple example is the implementation of a correcting agent, which receives the output of the primary model and checks it for factual correctness. If the agent detects an error, it can prompt the primary model to correct its output, or forward this correction to the user directly. Since this relies on the internal knowledge base of the correcting agent, the same caveats apply, as the correcting agent may confabulate as well. However, since the agent is independent of the primary model (being set up with dedicated prompts), it is less likely to confabulate in the same way.
This approach can be extended to a more complex model chain, where the correcting agent, for example, can query a knowledge graph or a vector database to ground its responses in prior knowledge. These chains are easy to implement, and some are available out of the box in the LangChain framework [@langchain]. However, they can behave unpredictably, which increases with the number of links in the chain and, as such, should be tightly controlled. They also add to the computational burden of the system, which is particularly relevant for deployments on end-user devices.