Question

I am trying to integrate Apache Nutch 2.2.1 with Elasticsearch 0.90.11.

I have followed all tutorials available (although there are not so many) and even changed bin/crawl.sh to use elasticsearch to index instead of solr. It seems that all works when I run the script until elasticsearch is trying to index the crawled data.

I checked hadoop.log inside logs folder under nutch and found the following errors:

  1. Error injecting constructor, java.lang.NoSuchFieldError: STOP_WORDS_SET
  2. Error injecting constructor, java.lang.NoClassDefFoundError: Could not initialize class org.apache.lucene.analysis.en.EnglishAnalyzer$DefaultSetHolder

If you managed to get it working I would very much appreciate the help.

Thanks, Andrei.

Was it helpful?

Solution

Having never used Apache Nutch, but briefly reading about it, I would suspect that your inclusion of Elasticsearch is causing a classpath collision with a different version of Lucene that is also on the classpath. Based on its Maven POM, which does not specify Lucene directly, then I would suggest only including the Lucene bundled with Elasticsearch, which should be Apache Lucene 4.6.1 for your version.

Duplicated code (differing versions of the same jar) tend to be the cause of NoClassDefFoundError when you are certain that you have the necessary code. Based on the fact that you switched from Solr to Elasticsearch, then it would make sense that you left whatever jars from Solr on your classpath, which would cause the collision at hand. The current release of Solr is 4.7.0, which is the same as Lucene and that would collide with 4.6.1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top