-
Notifications
You must be signed in to change notification settings - Fork 3
txt2solr.py
gramirez-prompsit edited this page May 31, 2021
·
2 revisions
txt2solr.py is a tool in Python that reads TSV files and uploads them to a Solr instance.
usage: txt2solr.py [-h] -c COLLECTION -p PREFIX [-b BLOCKSIZE] [--liteformat]
[-u USER] [-w PASSWORD] [-q] [--debug] [--logfile LOGFILE]
input
positional arguments:
input Corpus file (optionally gzipped).
optional arguments:
-h, --help show this help message and exit
options:
-c COLLECTION, --collection COLLECTION
Solr collection url
-p PREFIX, --prefix PREFIX
Prefix for Solr identifiers
-b BLOCKSIZE, --blocksize BLOCKSIZE
Amount of documents to upload to Solr at once
--liteformat True when the TSV comes is 5 column long.
-u USER, --user USER Solr user
-w PASSWORD, --password PASSWORD
Solr password
logging:
-q, --quiet Silent logging mode
--debug Debug logging mode
--logfile LOGFILE Store log to a file
For example:
python3.7 txt2solr.py -c http://localhost:20000/solr/paracrawl-en-es -p EN-ES --liteformat paracrawl.en-es.tsv -u solrusr -w solrpwd
The expected columns in the TSV to be uploaded, when using the --liteformat
flag is:
SOURCE_URL TARGET_URL SOURCE_SENTENCE TARGET_SENTENCE SCORE