bioschemas-nutch-indexer

An Apache Nutch plugin to extract and index bioschemas data from websites

Installation

Downloading and Running ElasticSearch 2.3.3

Download ES 2.3.3 from here.
Extract the elasticsearch-2.3.3.zip

./elasticsearch

Apache Nutch Install and Setup

Download the binary package (apache-nutch-1.13-bin.zip) of Apache Nutch 1.13 from here.
Extract apache-nutch-1.13-bin.zip to your favorite location (from now on NUTCH_HOME)
Change the http.agent.name property in the NUTCH_HOME/conf/nutch-site.xml file. It should look something like this:

<property>
 <name>http.agent.name</name>
 <value>My Bioschemas Spider</value>
</property>

Change the elastic.host and elastic.port properties in the NUTCH_HOME/conf/nutch-site.xml file. It should look something like this:

<property>
  <name>elastic.host</name>
  <value>127.0.0.1</value>
  <description>Comma-separated list of hostnames to send documents to using
  TransportClient. Either host and port must be defined or cluster.</description>
</property>

<property> 
  <name>elastic.port</name>
  <value>9300</value>
  <description>The port to connect to using TransportClient.</description>
</property>

Download and install the plugin

Download the binary distribution for the index-bioschemas plugin available at the releases page in this repo.
Copy the plugin folder to NUTCH_HOME/plugins
Copy the mimetypes.jar file from here to NUTCH_HOME/lib
Edit the plugin.includes property in the NUTCH_HOME/conf/nutch-site.xml file so it uses the index-bioschemas plugin:

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|bioschemas)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

Testing your installation

cd NUTCH_HOME
mkdir urls
cd urls
touch seed.txt
echo "https://tess.elixir-europe.org/events/uppmax-introductory-course-summer-2017" >> seed.txt
echo "https://tess.elixir-europe.org/events/the-genomics-era-the-future-of-genetics-in-medicine-6c93a777-54db-49f8-86f9-3be4bd91f87a" >> seed.txt
cd ..
bin/crawl -i urls tess-events/ 1

Then go to http://localhost:9200/nutch/_search?pretty=true&q=*:*&size=100

The documents body have several fields coming from the data crawled by Nutch, such as the plain text content (content), the crawling time stamp (tstamp), the source url (id) and the page title (title) among others. Inside the 'bioschemas' field you will find a JSON String containing the JSON representation of the microdata extraction result. This result is a JSON object, each field have the name of one item type coming from the extracted microdata, in this example we have "BreadCrumbList" and "Event". In those fields you will find JSON arrays with the JSON Object representation of the collected items.

Apache Nutch documentation

Please find more information about how to use Apache Nutch in order to crawl websites here

Setting up the development environment

In order to build Apache Nutch with the plugin you will need Apache Ant.

Download the apache-nutch-1.13-src.tar.gz file from here
Extract the content to your favorite location (NUTCH_SRC_HOME)
Make sure you can build Apache Nutch source out of the box

cd NUTCH_SRC_HOME
ant

Now you will see a ready to use binary install for Nutch in the path NUTCH_SRC_HOME/runtime/local 3. Clone this repo and copy the index-bioschemas folder into NUTCH_SRC_HOME/src/plugin 4. Edit the NUTCH_SRC_HOME/src/plugin/build.xml in order to add the plugin so Ant can deploy it, test it and clean it. Insert:

<ant dir="index-bioschemas" target="deploy"/>

In line 33:

<ant dir="index-bioschemas" target="test"/>

In line 104 and:

<ant dir="index-bioschemas" target="clean"/>

Edit the NUTCH_SRC_HOME/build.xml to make available the plugin packageset to Ant.

Insert:

<packageset dir="${plugins.dir}/index-bioschemas/src/java"/>

In lines 181 and 627.

Edit the NUTCH_SRC_HOME/build.xml to make available the plugin source path to Ant. Insert:

<source path="${plugins.dir}/index-bioschemas/src/java"/>

In line 1031.

Edit the NUTCH_SRC_HOME/default.properties to tell Apache Nutch that we want our plugin included in the build process. Insert:

org.apache.nutch.parse.bioschemas*:\

In line 149 and Insert:

org.apache.nutch.indexer.bioschemas*:\

In line 160.

There is a missing dependency for Apache Any23 in the Maven Repositories so you will have to add it manually to your Ivy local repo.

Download the missing jar from here. Change its name to commons-csv.jar and put it in HOME/.ivy2/local/org.apache.commons/commons-csv/1.0-SNAPSHOT-rev1148315/jars/

Got to NUTCH_SRC_HOME/ and run ant.
Now the NUTCH_SRC_HOME/runtime/local installation will include the plugin. In order to make it work correctly you will have to copy the mimetypes.jar to NUTCH_SRC_HOME/runtime/local/lib/
You can now test your installation. Execute steps 3 and 4 of the Apache Nutch Install and Setup section and step 4 from the plugin install section.
Run the commands in the Test your installation section using NUTCH_SRC_HOME/runtime/local insted of NUTCH_HOME.

Remember to run ant from the NUTCH_SRC_HOME every time you make a change in the code.

[OPTIONAL] From NUTCH_SRC_HOME run:

ant eclipse

In order to generate the Eclipse project files so you can later import the project to the IDE. If you want to run Nutch from Eclipse you will have to follow this guide.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
index-bioschemas		index-bioschemas
lib		lib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bioschemas-nutch-indexer

Installation

Downloading and Running ElasticSearch 2.3.3

Apache Nutch Install and Setup

Download and install the plugin

Testing your installation

Apache Nutch documentation

Setting up the development environment

About

Releases 1

Packages

Languages

License

BioSchemas/bioschemas-nutch-indexer

Folders and files

Latest commit

History

Repository files navigation

bioschemas-nutch-indexer

Installation

Downloading and Running ElasticSearch 2.3.3

Apache Nutch Install and Setup

Download and install the plugin

Testing your installation

Apache Nutch documentation

Setting up the development environment

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages