An Apache Nutch plugin to extract and index bioschemas data from websites
- Download ES 2.3.3 from here.
- Extract the elasticsearch-2.3.3.zip
./elasticsearch
- Download the binary package (apache-nutch-1.13-bin.zip) of Apache Nutch 1.13 from here.
- Extract apache-nutch-1.13-bin.zip to your favorite location (from now on NUTCH_HOME)
- Change the http.agent.name property in the NUTCH_HOME/conf/nutch-site.xml file. It should look something like this:
<property>
<name>http.agent.name</name>
<value>My Bioschemas Spider</value>
</property>
- Change the elastic.host and elastic.port properties in the NUTCH_HOME/conf/nutch-site.xml file. It should look something like this:
<property>
<name>elastic.host</name>
<value>127.0.0.1</value>
<description>Comma-separated list of hostnames to send documents to using
TransportClient. Either host and port must be defined or cluster.</description>
</property>
<property>
<name>elastic.port</name>
<value>9300</value>
<description>The port to connect to using TransportClient.</description>
</property>
- Download the binary distribution for the index-bioschemas plugin available at the releases page in this repo.
- Copy the plugin folder to NUTCH_HOME/plugins
- Copy the mimetypes.jar file from here to NUTCH_HOME/lib
- Edit the plugin.includes property in the NUTCH_HOME/conf/nutch-site.xml file so it uses the index-bioschemas plugin:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|bioschemas)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
cd NUTCH_HOME
mkdir urls
cd urls
touch seed.txt
echo "https://tess.elixir-europe.org/events/uppmax-introductory-course-summer-2017" >> seed.txt
echo "https://tess.elixir-europe.org/events/the-genomics-era-the-future-of-genetics-in-medicine-6c93a777-54db-49f8-86f9-3be4bd91f87a" >> seed.txt
cd ..
bin/crawl -i urls tess-events/ 1
Then go to http://localhost:9200/nutch/_search?pretty=true&q=*:*&size=100
The documents body have several fields coming from the data crawled by Nutch, such as the plain text content (content), the crawling time stamp (tstamp), the source url (id) and the page title (title) among others. Inside the 'bioschemas' field you will find a JSON String containing the JSON representation of the microdata extraction result. This result is a JSON object, each field have the name of one item type coming from the extracted microdata, in this example we have "BreadCrumbList" and "Event". In those fields you will find JSON arrays with the JSON Object representation of the collected items.
Please find more information about how to use Apache Nutch in order to crawl websites here
In order to build Apache Nutch with the plugin you will need Apache Ant.
- Download the apache-nutch-1.13-src.tar.gz file from here
- Extract the content to your favorite location (NUTCH_SRC_HOME)
- Make sure you can build Apache Nutch source out of the box
cd NUTCH_SRC_HOME
ant
Now you will see a ready to use binary install for Nutch in the path NUTCH_SRC_HOME/runtime/local 3. Clone this repo and copy the index-bioschemas folder into NUTCH_SRC_HOME/src/plugin 4. Edit the NUTCH_SRC_HOME/src/plugin/build.xml in order to add the plugin so Ant can deploy it, test it and clean it. Insert:
<ant dir="index-bioschemas" target="deploy"/>
In line 33:
<ant dir="index-bioschemas" target="test"/>
In line 104 and:
<ant dir="index-bioschemas" target="clean"/>
- Edit the NUTCH_SRC_HOME/build.xml to make available the plugin packageset to Ant.
Insert:
<packageset dir="${plugins.dir}/index-bioschemas/src/java"/>
In lines 181 and 627.
- Edit the NUTCH_SRC_HOME/build.xml to make available the plugin source path to Ant. Insert:
<source path="${plugins.dir}/index-bioschemas/src/java"/>
In line 1031.
- Edit the NUTCH_SRC_HOME/default.properties to tell Apache Nutch that we want our plugin included in the build process. Insert:
org.apache.nutch.parse.bioschemas*:\
In line 149 and Insert:
org.apache.nutch.indexer.bioschemas*:\
In line 160.
- There is a missing dependency for Apache Any23 in the Maven Repositories so you will have to add it manually to your Ivy local repo.
Download the missing jar from here. Change its name to commons-csv.jar and put it in HOME/.ivy2/local/org.apache.commons/commons-csv/1.0-SNAPSHOT-rev1148315/jars/
-
Got to NUTCH_SRC_HOME/ and run ant.
-
Now the NUTCH_SRC_HOME/runtime/local installation will include the plugin. In order to make it work correctly you will have to copy the mimetypes.jar to NUTCH_SRC_HOME/runtime/local/lib/
-
You can now test your installation. Execute steps 3 and 4 of the Apache Nutch Install and Setup section and step 4 from the plugin install section.
-
Run the commands in the Test your installation section using NUTCH_SRC_HOME/runtime/local insted of NUTCH_HOME.
Remember to run ant from the NUTCH_SRC_HOME every time you make a change in the code.
- [OPTIONAL] From NUTCH_SRC_HOME run:
ant eclipse
In order to generate the Eclipse project files so you can later import the project to the IDE. If you want to run Nutch from Eclipse you will have to follow this guide.