Exercises for Chapter 6. "Setting up a reproducible project: question and answer pipeline "
-
Reproducible workflow building blocks
-
Setting up a Q&A System:
-
Case I: Q&A system for small collection of text
-
Case II: Q&A system for complex knowledge bases
-
Incorporating observability and evaluation of results
-
Optimizing performance through a feedback loop
To follow along and successfully implement the concepts discussed in this chapter, you need to set up a proper development environment. The main technical requirement is Python, which will be used to manage the dependencies and run the necessary code for the project. Specifically, you need:
Python 3.12 or Higher
The project is designed to work with Python 3.12, so ensure you have this version installed. If not, consider using a tool like pyenv to manage and install the correct Python version.
We will use just scripts and uv for managing dependencies efficiently. These tools allow us to handle virtual environments and package installation seamlessly. For more information, visit https://github.com/casey/just and https://astral.sh/blog/uv to perform installation. Installation guide for just can be found here https://github.com/casey/just?tab=readme-ov-file#installation
It is also suggested to use pyenv to manage Python versions https://github.com/pyenv/pyenv, installation instructions can be found here https://github.com/pyenv/pyenv?tab=readme-ov-file#installation
You can set Python 3.12 globally as follows:
$ pyenv global system 3.12
-
Virtual Environment: We recommend creating a virtual environment to isolate the project dependencies and avoid conflicts with other Python installations.
-
pipx
: While optional, using pipx for managing command-line tools like uv can simplify dependency installation and management. -
Docker desktop installed: We will be using a Docker image for Elastic Search to populate our vector database. https://www.docker.com/products/docker-desktop/.
The project requires several Python libraries, which will be installed through the provided justfile. The main libraries include:
- Haystack for building the Q&A pipeline
- OpenAI for embedding generation
- Additional utilities like BeautifulSoup, dotenv, and more for pre-processing and environment management.
- Bonus: Bytewax for real time processing
We will add these as part of a pyproject.toml
file found here.
By ensuring these requirements are met, you'll be able to set up a reproducible and scalable environment for the Q&A pipeline project.
In this folder, we have prepared a justfile
to ensure this process is as smooth as possible. To get started, and once you have installed all requirements, run
just get-started
This will create a venvs
folder. Activate
. venvs/bin/dev activate
To add new dependencies, modify the dependencies in the pyproject.toml file then run
just develop
to update.
To create a kernel using the virtual environment, run
python -m ipykernel install --user --name=venv --display-name "NLP pipelines"
Enable Docker and run
docker compose up
Ensure you have an OpenAI key and have stored it into a .env
file under a variable name such as OPENAI_API_KEY
The pipelines can be found here:
- Indexing pipeline with custom components to extract, chunk and embed information from a JSONL file
- Querying pipeline with prompt templating to retrieve information
Indexing pipeline
Query pipeline
Ensure you have an OpenAI key and have stored it into a .env
file under a variable name such as OPENAI_API_KEY
The notebook includes an indexing pipeline to extract, chunk and embed information from the Haystack tutorials as well as a querying pipeline with a specific role (software engineer) to build code from the tutorials.
We included two examples. You can get started with a simple example that logs token usage and time to perform embedding.
This assumes you have created a Weights and Biases account, and that you have an API key, ensure to store it as part of an .env
file under a variable name such as WANDB_API_KEY
.
The complete RAGAS query pipeline can be visualized as follows