- Install JDK 7 or higher
- Install JRE latest version
- Install Eclipse
- Install manven latest version
- Open terminal and type: sudo apt-get install maven
- Download and install the solr-6.6.0 or higher from it official site
- Download genomic data repository from TREC 2007 Genomics Track Data
- Run solr service by default it use port 8983. To check in your browser type: localhost:8983
- Create a new core or collection, the default core/colletion directory is /var/solr/data
- Once you download your data repositoty, extract them and combine all the files under one directory, its require about 9.8 GB of space
- Now Index the data for you created core/collection:
- Solr indexed your data according to your default solrconfig.xml schema but you can define and specify your own fields
you can Update solrschema.xml and managed-schema located in your new created core/collection directory files by adding new fields
- Open /var/solr/data/<core/collection name>/conf/ --> managed-schema, solrschema.xml
- In solrschema.xml file: Search
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler"/>
- Add a new field inside the above tag
<str name="capture">body</str>
- To add your own replace the body with your own field name
- Save and exit the file
- In managed-schema file: Search
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false" />
- Add a new tag uder the "text" field
<field name="body" type="text_general" indexed="true" stored="true"/>
- To add your own replace the "body" with your own field name
- In solrschema.xml file: Search
- Restart your solr by type command
sudo service solr restart- If your solr get error, Check you configuration files properly
Note: Name must be same in managed-schema and solrconfig.xml files
- If your solr get error, Check you configuration files properly
- Open /var/solr/data/<core/collection name>/conf/ --> managed-schema, solrschema.xml
- Open terminal in you project root directory and type
mvn compile
It will compiles all the dependencies in your pom.xml file
- To run code properly the following files must be download and extract to their proper place
Following files must be included in you resources dir
- Downlaod trecgen2007.gold.standard.tsv.txt
- Downlaod 2007topics.txt
- Download Wordnet-3.0. Create a new directory in the main project named as data and extract the contants of wordnet-3.0 inside this data dir.
- Add a folder name script under the resource dir
- Download trecgen2007_score.py and save under script dir
- Download the Sementic Types Mappings and Sementic Group File. also create a dir named Mappings under resources dir and put the two Sementic Types and Sementic Group Files.
- Now create a dir named DocResult under resource dir --> (This directory will be used for the output of results comparision)