Domanda

while following this link i'm getting this error but can't figure out it http://wiki.apache.org/nutch/NutchTutorial

runtime/local$ bin/nutch parse $s1 ParseSegment: starting at 2013-10-11 17:43:36 ParseSegment: segment: crawl/segments/20131011173126 Exception in thread "main" java.io.IOException: Segment already parsed! at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213) at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:247) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:220)

È stato utile?

Soluzione

This will happen when you want to parse an already parsed segment. Note that if you use the "crawl" command it also parses the segment.

If you really want to parse again, just remove the crawl_parse directory inside your segment (i.e. crawl/segments/20131011173126/crawl_parse) and issue the parse command again.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top