Question

while following this link i'm getting this error but can't figure out it http://wiki.apache.org/nutch/NutchTutorial

runtime/local$ bin/nutch parse $s1 ParseSegment: starting at 2013-10-11 17:43:36 ParseSegment: segment: crawl/segments/20131011173126 Exception in thread "main" java.io.IOException: Segment already parsed! at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213) at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:247) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:220)

Was it helpful?

Solution

This will happen when you want to parse an already parsed segment. Note that if you use the "crawl" command it also parses the segment.

If you really want to parse again, just remove the crawl_parse directory inside your segment (i.e. crawl/segments/20131011173126/crawl_parse) and issue the parse command again.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top