Skip to content

txt2solr.py

gramirez-prompsit edited this page May 31, 2021 · 2 revisions

txt2solr.py is a tool in Python that reads TSV files and uploads them to a Solr instance.

Usage

usage: txt2solr.py [-h] -c COLLECTION -p PREFIX [-b BLOCKSIZE] [--liteformat]
                   [-u USER] [-w PASSWORD] [-q] [--debug] [--logfile LOGFILE]
                   input

positional arguments:
  input                 Corpus file (optionally gzipped).

optional arguments:
  -h, --help            show this help message and exit

options:
  -c COLLECTION, --collection COLLECTION
                        Solr collection url
  -p PREFIX, --prefix PREFIX
                        Prefix for Solr identifiers
  -b BLOCKSIZE, --blocksize BLOCKSIZE
                        Amount of documents to upload to Solr at once
  --liteformat          True when the TSV comes is 5 column long.
  -u USER, --user USER  Solr user
  -w PASSWORD, --password PASSWORD
                        Solr password

logging:
  -q, --quiet           Silent logging mode
  --debug               Debug logging mode
  --logfile LOGFILE     Store log to a file

For example:

python3.7 txt2solr.py -c http://localhost:20000/solr/paracrawl-en-es -p EN-ES --liteformat  paracrawl.en-es.tsv -u solrusr -w solrpwd

The expected columns in the TSV to be uploaded, when using the --liteformat flag is:

SOURCE_URL  TARGET_URL SOURCE_SENTENCE TARGET_SENTENCE SCORE
Clone this wiki locally