You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
print("---------------- Printing JOb Progress ------- : ", cc.progress)
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
print("---------------- Printing JOb Progess ------- : ", job)
if job == None:
break
---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30>
---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>>
nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'FAILED', 'msg': 'ERROR: java.io.IOException: Job failed!', 'crawlId': 'test'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : FAILED
Traceback (most recent call last):
File "test.py", line 18, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/home/purushottam/Documents/tech_learn/ex_nutch/nutch-python/nutch/nutch.py", line 563, in progress
raise NutchCrawlException
nutch.nutch.NutchCrawlException
`
**Nutch Hadoop Log file : **
`2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: test/crawldb
2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: test/segments/20180719002543
2018-07-19 13:04:16,528 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-07-19 13:04:16,726 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: content dest: content
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: title dest: title
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: host dest: host
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: segment dest: segment
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: boost dest: boost
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: digest dest: digest
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2018-07-19 13:04:16,738 WARN mapred.LocalJobRunner - job_local1889671569_0083
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
<title>Error 404 Not Found</title>
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Not Found
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
<title>Error 404 Not Found</title>
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Not Found
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-07-19 13:04:17,636 ERROR impl.JobWorker - Cannot run job worker!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:96)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:89)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:351)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)`
The text was updated successfully, but these errors were encountered:
I'm trying to run this script but getting error whle indexing the data ( Job Status : Failed).
https://github.com/chrismattmann/nutch-python/wiki
Script:
from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch
sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls)
nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
print("---------------- Printing JOb Progress ------- : ", cc.progress)
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
print("---------------- Printing JOb Progess ------- : ", job)
if job == None:
break
End Output :
`---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30>
---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>>
nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'RUNNING', 'msg': 'OK', 'crawlId': 'test'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : RUNNING
---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30>
---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>>
nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'FAILED', 'msg': 'ERROR: java.io.IOException: Job failed!', 'crawlId': 'test'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : FAILED
Traceback (most recent call last):
File "test.py", line 18, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/home/purushottam/Documents/tech_learn/ex_nutch/nutch-python/nutch/nutch.py", line 563, in progress
raise NutchCrawlException
nutch.nutch.NutchCrawlException
`
**Nutch Hadoop Log file : **
`2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: test/crawldb
<title>Error 404 Not Found</title>2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: test/segments/20180719002543
2018-07-19 13:04:16,528 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-07-19 13:04:16,726 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: content dest: content
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: title dest: title
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: host dest: host
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: segment dest: segment
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: boost dest: boost
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: digest dest: digest
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2018-07-19 13:04:16,738 WARN mapred.LocalJobRunner - job_local1889671569_0083
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
<title>Error 404 Not Found</title>HTTP ERROR 404
Problem accessing /solr/update. Reason:
2018-07-19 13:04:17,636 ERROR impl.JobWorker - Cannot run job worker!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:96)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:89)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:351)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)`
The text was updated successfully, but these errors were encountered: