diff --git a/README.md b/README.md index 89d7abdab..183e781eb 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ A low-effort way to try out Pyserini is to look at our [online notebooks](https: For convenience, we've pre-built a few common indexes, available to download [here](https://git.uwaterloo.ca/jimmylin/anserini-indexes). Pyserini versions adopt the convention of _X.Y.Z.W_, where _X.Y.Z_ tracks the version of Anserini, and _W_ is used to distinguish different releases on the Python end. -The current stable release of Pyserini is [v0.9.1.0](https://pypi.org/project/pyserini/) on PyPI. +The current stable release of Pyserini is [v0.9.2.0](https://pypi.org/project/pyserini/) on PyPI. The current experimental release of Pyserini on TestPyPI is behind the current stable release (i.e., do not use). In general, documentation is kept up to date with the latest code in the repo. @@ -23,7 +23,7 @@ If you're looking to work with the [COVID-19 Open Research Dataset (CORD-19)](ht Install via PyPI ``` -pip install pyserini==0.9.1.0 +pip install pyserini==0.9.2.0 ``` ## Simple Usage diff --git a/docs/working-with-cord19.md b/docs/working-with-cord19.md index ef26532ef..d6fa161c9 100644 --- a/docs/working-with-cord19.md +++ b/docs/working-with-cord19.md @@ -6,28 +6,19 @@ If you want to actually search the collection, consult [this guide](https://gith ## Data Prep -The latest distribution available is from 2020/05/01. +The latest distribution available is from 2020/05/12. First, download the data: ```bash -DATE=2020-05-01 -DATA_DIR=./cord19-"${DATE}" +DATE=2020-05-12 +DATA_DIR=./collections/cord19-"${DATE}" mkdir "${DATA_DIR}" -wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/comm_use_subset.tar.gz -P "${DATA_DIR}" -wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/noncomm_use_subset.tar.gz -P "${DATA_DIR}" -wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/custom_license.tar.gz -P "${DATA_DIR}" -wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/biorxiv_medrxiv.tar.gz -P "${DATA_DIR}" -wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/arxiv.tar.gz -P "${DATA_DIR}" +wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/document_parses.tar.gz -P "${DATA_DIR}" wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/metadata.csv -P "${DATA_DIR}" -ls "${DATA_DIR}"/*.tar.gz | xargs -I {} tar -zxvf {} -C "${DATA_DIR}" -# If the above doesn't work due to cross-OS compatibility issues with xargs, untar all folders individually -# tar -zxvf "${DATA_DIR}"/comm_use_subset.tar.gz -C "${DATA_DIR}" -# tar -zxvf "${DATA_DIR}"/noncomm_use_subset.tar.gz -C "${DATA_DIR}" -# tar -zxvf "${DATA_DIR}"/custom_license.tar.gz -C "${DATA_DIR}" -# tar -zxvf "${DATA_DIR}"/biorxiv_medrxiv.tar.gz -C "${DATA_DIR}" -# tar -zxvf "${DATA_DIR}"/arxiv.tar.gz -C "${DATA_DIR}" +ls "${DATA_DIR}"/document_parses.tar.gz | xargs -I {} tar -zxvf {} -C "${DATA_DIR}" +rm "${DATA_DIR}"/document_parses.tar.gz ``` ## Collection Access @@ -37,7 +28,7 @@ The following snippet of code allows you to iterate through all articles in the ```python from pyserini.collection import pycollection -collection = pycollection.Collection('Cord19AbstractCollection', 'cord19-2020-05-01') +collection = pycollection.Collection('Cord19AbstractCollection', 'collections/cord19-2020-05-12') cnt = 0; full_text = {True : 0, False: 0} @@ -64,7 +55,7 @@ Let's examine the first full-text article in the collection: from pyserini.collection import pycollection # All this snippet of code does is to advance to the frist full-text article: -collection = pycollection.Collection('Cord19AbstractCollection', 'cord19-2020-05-01') +collection = pycollection.Collection('Cord19AbstractCollection', 'collections/cord19-2020-05-12') articles = collection.__next__() article = None diff --git a/setup.py b/setup.py index d9cf0f92c..9cdd7f3d3 100644 --- a/setup.py +++ b/setup.py @@ -5,14 +5,14 @@ setuptools.setup( name="pyserini", - version="0.9.1.0", + version="0.9.2.0", author="Jimmy Lin", author_email="jimmylin@uwaterloo.ca", description="Python interface to the Anserini IR toolkit built on Lucene", long_description=long_description, long_description_content_type="text/markdown", package_data={"pyserini": [ - "resources/jars/anserini-0.9.1-fatjar.jar", + "resources/jars/anserini-0.9.2-fatjar.jar", ]}, url="https://github.com/castorini/pyserini", install_requires=['Cython', 'pyjnius'],