Skip to content

Commit

Permalink
Release 0.9.2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool committed May 15, 2020
1 parent b3b05b5 commit 4bcf54d
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 21 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ A low-effort way to try out Pyserini is to look at our [online notebooks](https:
For convenience, we've pre-built a few common indexes, available to download [here](https://git.uwaterloo.ca/jimmylin/anserini-indexes).

Pyserini versions adopt the convention of _X.Y.Z.W_, where _X.Y.Z_ tracks the version of Anserini, and _W_ is used to distinguish different releases on the Python end.
The current stable release of Pyserini is [v0.9.1.0](https://pypi.org/project/pyserini/) on PyPI.
The current stable release of Pyserini is [v0.9.2.0](https://pypi.org/project/pyserini/) on PyPI.
The current experimental release of Pyserini on TestPyPI is behind the current stable release (i.e., do not use).
In general, documentation is kept up to date with the latest code in the repo.

Expand All @@ -23,7 +23,7 @@ If you're looking to work with the [COVID-19 Open Research Dataset (CORD-19)](ht
Install via PyPI

```
pip install pyserini==0.9.1.0
pip install pyserini==0.9.2.0
```

## Simple Usage
Expand Down
25 changes: 8 additions & 17 deletions docs/working-with-cord19.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,19 @@ If you want to actually search the collection, consult [this guide](https://gith

## Data Prep

The latest distribution available is from 2020/05/01.
The latest distribution available is from 2020/05/12.
First, download the data:

```bash
DATE=2020-05-01
DATA_DIR=./cord19-"${DATE}"
DATE=2020-05-12
DATA_DIR=./collections/cord19-"${DATE}"
mkdir "${DATA_DIR}"

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/comm_use_subset.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/noncomm_use_subset.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/custom_license.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/biorxiv_medrxiv.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/arxiv.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/document_parses.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/metadata.csv -P "${DATA_DIR}"

ls "${DATA_DIR}"/*.tar.gz | xargs -I {} tar -zxvf {} -C "${DATA_DIR}"
# If the above doesn't work due to cross-OS compatibility issues with xargs, untar all folders individually
# tar -zxvf "${DATA_DIR}"/comm_use_subset.tar.gz -C "${DATA_DIR}"
# tar -zxvf "${DATA_DIR}"/noncomm_use_subset.tar.gz -C "${DATA_DIR}"
# tar -zxvf "${DATA_DIR}"/custom_license.tar.gz -C "${DATA_DIR}"
# tar -zxvf "${DATA_DIR}"/biorxiv_medrxiv.tar.gz -C "${DATA_DIR}"
# tar -zxvf "${DATA_DIR}"/arxiv.tar.gz -C "${DATA_DIR}"
ls "${DATA_DIR}"/document_parses.tar.gz | xargs -I {} tar -zxvf {} -C "${DATA_DIR}"
rm "${DATA_DIR}"/document_parses.tar.gz
```

## Collection Access
Expand All @@ -37,7 +28,7 @@ The following snippet of code allows you to iterate through all articles in the
```python
from pyserini.collection import pycollection

collection = pycollection.Collection('Cord19AbstractCollection', 'cord19-2020-05-01')
collection = pycollection.Collection('Cord19AbstractCollection', 'collections/cord19-2020-05-12')

cnt = 0;
full_text = {True : 0, False: 0}
Expand All @@ -64,7 +55,7 @@ Let's examine the first full-text article in the collection:
from pyserini.collection import pycollection

# All this snippet of code does is to advance to the frist full-text article:
collection = pycollection.Collection('Cord19AbstractCollection', 'cord19-2020-05-01')
collection = pycollection.Collection('Cord19AbstractCollection', 'collections/cord19-2020-05-12')

articles = collection.__next__()
article = None
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@

setuptools.setup(
name="pyserini",
version="0.9.1.0",
version="0.9.2.0",
author="Jimmy Lin",
author_email="[email protected]",
description="Python interface to the Anserini IR toolkit built on Lucene",
long_description=long_description,
long_description_content_type="text/markdown",
package_data={"pyserini": [
"resources/jars/anserini-0.9.1-fatjar.jar",
"resources/jars/anserini-0.9.2-fatjar.jar",
]},
url="https://github.com/castorini/pyserini",
install_requires=['Cython', 'pyjnius'],
Expand Down

0 comments on commit 4bcf54d

Please sign in to comment.