Description

Exercises for Chapter 6. "Setting up a reproducible project: question and answer pipeline "

Topics covered

Reproducible workflow building blocks
Setting up a Q&A System:

Case I: Q&A system for small collection of text
Case II: Q&A system for complex knowledge bases

Incorporating observability and evaluation of results
Optimizing performance through a feedback loop

Technical requirements

To follow along and successfully implement the concepts discussed in this chapter, you need to set up a proper development environment. The main technical requirement is Python, which will be used to manage the dependencies and run the necessary code for the project. Specifically, you need:

Python 3.12 or Higher

The project is designed to work with Python 3.12, so ensure you have this version installed. If not, consider using a tool like pyenv to manage and install the correct Python version.

Dependency Management

We will use just scripts and uv for managing dependencies efficiently. These tools allow us to handle virtual environments and package installation seamlessly. For more information, visit https://github.com/casey/just and https://astral.sh/blog/uv to perform installation. Installation guide for just can be found here https://github.com/casey/just?tab=readme-ov-file#installation

It is also suggested to use pyenv to manage Python versions https://github.com/pyenv/pyenv, installation instructions can be found here https://github.com/pyenv/pyenv?tab=readme-ov-file#installation

You can set Python 3.12 globally as follows:

$ pyenv global system 3.12

Virtual Environment: We recommend creating a virtual environment to isolate the project dependencies and avoid conflicts with other Python installations.
pipx: While optional, using pipx for managing command-line tools like uv can simplify dependency installation and management.
Docker desktop installed: We will be using a Docker image for Elastic Search to populate our vector database. https://www.docker.com/products/docker-desktop/.

Libraries and Packages

The project requires several Python libraries, which will be installed through the provided justfile. The main libraries include:

Haystack for building the Q&A pipeline
OpenAI for embedding generation
Additional utilities like BeautifulSoup, dotenv, and more for pre-processing and environment management.
Bonus: Bytewax for real time processing

We will add these as part of a pyproject.toml file found here.

By ensuring these requirements are met, you'll be able to set up a reproducible and scalable environment for the Q&A pipeline project.

Executing with `just`

In this folder, we have prepared a justfile to ensure this process is as smooth as possible. To get started, and once you have installed all requirements, run

just get-started

This will create a venvs folder. Activate

. venvs/bin/dev activate

To add new dependencies, modify the dependencies in the pyproject.toml file then run

just develop

to update.

To create a kernel using the virtual environment, run

python -m ipykernel install --user --name=venv --display-name "NLP pipelines"

Execute Elastic Search

Enable Docker and run

docker compose up

Pipelines

Case I: Q&A for simple text with custom components

Ensure you have an OpenAI key and have stored it into a .env file under a variable name such as OPENAI_API_KEY

The pipelines can be found here:

Indexing pipeline with custom components to extract, chunk and embed information from a JSONL file
Querying pipeline with prompt templating to retrieve information

Indexing pipeline

Query pipeline

Case II: Q&A for a complex knowledge base

Ensure you have an OpenAI key and have stored it into a .env file under a variable name such as OPENAI_API_KEY

The notebook includes an indexing pipeline to extract, chunk and embed information from the Haystack tutorials as well as a querying pipeline with a specific role (software engineer) to build code from the tutorials.

Notebook

Adding evaluation and observability

We included two examples. You can get started with a simple example that logs token usage and time to perform embedding.

Add logging with Weights and Biases - indexing
Add logging with Weights and Biases - querying

This assumes you have created a Weights and Biases account, and that you have an API key, ensure to store it as part of an .env file under a variable name such as WANDB_API_KEY.

Add evaluation with RAGAS - query pipeline

The complete RAGAS query pipeline can be visualized as follows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Description

Topics covered

Technical requirements

Dependency Management

Libraries and Packages

Executing with `just`

Execute Elastic Search

Pipelines

Case I: Q&A for simple text with custom components

Case II: Q&A for a complex knowledge base

Adding evaluation and observability

Files

README.md

Latest commit

History

README.md

File metadata and controls

Description

Topics covered

Technical requirements

Dependency Management

Libraries and Packages

Executing with just

Execute Elastic Search

Pipelines

Case I: Q&A for simple text with custom components

Case II: Q&A for a complex knowledge base

Adding evaluation and observability

Executing with `just`