Pyserini is built on Python 3.10 (other versions might work, but YMMV).
See pyproject.toml
for a detailed list of dependencies.
At a high level:
- Pyserini depends on Anserini, which is built on Lucene. PyJNIus is used to interact with the JVM. We depend on Java 21.
- We need PyTorch, 🤗 Transformers, and the ONNX Runtime for "neural stuff".
A pip
installation will automatically pull in major dependencies without any major issues 🤞:
pip install pyserini
The toolkit also has a number of optional dependencies:
pip install 'pyserini[optional]'
Notably, faiss-cpu
, lightgbm
, and nmslib
are included in these optional dependencies.
Installation of these packages can be temperamental, which is why they are not included in the core dependencies.
It might be a good idea to install these yourself separately.
Below is a step-by-step Pyserini installation guide based on Python 3.10. We recommend using Anaconda and assume you have already installed it. The following instructions are up to date as of November 2024 and should work.
If you're on a Mac with an M-series (i.e., ARM) processor:
conda create -n pyserini python=3.10 -y
conda activate pyserini
# Inside the new environment...
conda install -c anaconda wget -y
conda install -c conda-forge openjdk=21 maven -y
# If you want the optional dependencies, otherwise skip
conda install -c conda-forge lightgbm -y
conda install -c anaconda nmslib -y
conda install -c pytorch faiss-cpu -y
# Good idea to always explicitly specify the latest version, found here: https://pypi.org/project/pyserini/
pip install pyserini==latest
# If you want the optional dependencies, otherwise skip; the temperamental packages are already installed at this point
# so should be smooth...
pip install 'pyserini[optional]==latest'
If you're on an Intel-based Mac, adjust the recipe accordingly for osx-64
.
❗ If you get numpy
v2 vs. v1 issues, you might need to explicitly downgrade numpy
:
pip install numpy==1.26.4
For more details, see facebookresearch/faiss#3526
Follow the recipe below:
conda create -n pyserini python=3.10 -y
conda activate pyserini
# Inside the new environment...
conda install -c conda-forge openjdk=21 maven -y
# If you want the optional dependencies, otherwise skip
conda install -c conda-forge lightgbm nmslib -y
conda install -c pytorch faiss-cpu -y
# Good idea to always explicitly specify the latest version, found here: https://pypi.org/project/pyserini/
pip install pyserini==latest
# If you want the optional dependencies, otherwise skip; the temperamental packages are already installed at this point
# so should be smooth...
pip install 'pyserini[optional]==latest'
❗ If you get numpy
v2 vs. v1 issues, you might need to explicitly downgrade numpy
:
pip install numpy==1.26.4
For more details, see facebookresearch/faiss#3526
By this point, Pyserini should have been installed. It might be worthwhile to do a bit of sanity checking, per below.
To confirm that bag-of-words retrieval is working correctly, you can run the BM25 baseline on the MS MARCO passage ranking task:
python -m pyserini.search.lucene \
--threads 16 --batch-size 128 \
--index msmarco-v1-passage \
--topics msmarco-passage-dev-subset \
--output run.msmarco-v1-passage.bm25-default.dev.txt \
--bm25 --output-format msmarco
python -m pyserini.eval.msmarco_passage_eval \
msmarco-passage-dev-subset run.msmarco-v1-passage.bm25-default.dev.txt
Expected Results:
MRR @10: 0.18741227770955546
To confirm that dense retrieval is working correctly with Lucene using an HNSW index:
python -m pyserini.search.lucene \
--threads 16 --batch-size 128 --dense --hnsw \
--index msmarco-v1-passage.bge-base-en-v1.5.hnsw \
--topics dl19-passage \
--onnx-encoder BgeBaseEn15 \
--output run.msmarco-v1-passage.bge-base-en-v1.5.lucene-hnsw.onnx.dl19.txt \
--hits 1000 --ef-search 1000
python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage \
run.msmarco-v1-passage.bge-base-en-v1.5.lucene-hnsw.onnx.dl19.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage \
run.msmarco-v1-passage.bge-base-en-v1.5.lucene-hnsw.onnx.dl19.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage \
run.msmarco-v1-passage.bge-base-en-v1.5.lucene-hnsw.onnx.dl19.txt
Expected Results:
map all 0.4486
ndcg_cut_10 all 0.7016
recall_1000 all 0.8441
To confirm that dense retrieval is working correctly with Faiss, run our TCT-ColBERT (v2) model on the MS MARCO passage ranking task:
python -m pyserini.search.faiss \
--topics msmarco-passage-dev-subset \
--index msmarco-v1-passage.tct_colbert-v2-hnp \
--encoded-queries tct_colbert-v2-hnp-msmarco-passage-dev-subset \
--threads 12 --batch-size 384 \
--output run.msmarco-passage.tct_colbert-v2.bf.tsv \
--output-format msmarco
python -m pyserini.eval.msmarco_passage_eval \
msmarco-passage-dev-subset run.msmarco-passage.tct_colbert-v2.bf.tsv
Expected Results:
#####################
MRR @10: 0.3584
QueriesRanked: 6980
#####################
If you've gotten to here, then everything should be working properly.
If you're planning on just using Pyserini, then the instructions above are fine. However, if you're planning on contributing to the codebase or want to work with the latest not-yet-released features, you'll need a development installation.
Start the same way as the install above, but don't install pip install pyserini
.
Instead, clone the Pyserini repo with the --recurse-submodules
option to make sure the tools/
submodule also gets cloned:
git clone [email protected]:castorini/pyserini.git --recurse-submodules
The tools/
directory, which contains evaluation tools and scripts, is actually this repo, integrated as a Git submodule (so that it can be shared across related projects).
Change into the pyserini
subdirectory and build as follows (you might get warnings, but okay to ignore):
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..
Then, in the pyserini
clone, use pip
to add an "editable" installation, as follows:
pip install -e .
You'll need to download the Spacy English model to reproduce tasks such as LTR Filtering for MS MARCO Passage.
python -m spacy download en_core_web_sm
Next, you'll need to clone and build Anserini.
It makes sense to put both pyserini/
and anserini/
in a common folder.
After you've successfully built Anserini, copy the fatjar, which will be target/anserini-X.Y.Z-SNAPSHOT-fatjar.jar
into pyserini/resources/jars/
.
As with the pip
installation, a potential source of frustration is incompatibility among different versions of underlying dependencies.
You can confirm everything is working by running the unit tests:
python -m unittest
Assuming all tests pass, you should be ready to go!
- The above guide handles JVM installation via conda. If you are using your own Java environment and get an error about Java version mismatch, it's likely an issue with your
JAVA_HOME
environmental variable. Inbash
, useecho $JAVA_HOME
to find out what the environmental variable is currently set to, and useexport JAVA_HOME=/path/to/java/home
to change it to the correct path. On a Linux system, the correct path might look something like/usr/lib/jvm/java-21
. Unfortunately, we are unable to offer more concrete advice since the actual path depends on your OS, which JDK you're using, and a host of other factors. - On Apple's M-series processors, make sure you've installed the ARM-based release of Conda instead of the Intel-based release.
At the University of Waterloo, we have two (CPU) development servers, tuna
and orca
.
Note that on these two servers, the root disk (where your home directory is mounted) doesn't have much space.
So, you need to set pyserini cache path to scratch space.
- For tuna, create the dir
/tuna1/scratch/{username}
- For orca, create the dir
/store/scratch/{username}
Set the PYSERINI_CACHE
environment variable to point to the directory you created above.