質問

Which is the best integration for Apache Tika assuming that I already connected and used Nutch(2.2.1) + Solr (4.3)?

I understand that Tika can be integrated within Nutch and/or Solr, but which one is the best decision?

役に立ちましたか?

解決

Set up the Tika plugin with Nutch, Nutch will parse the data for you and will do all the hard work for you.

I would suggest setting it up on Solr as well, you may wish to send documents to Solr via the curl command and it would help to have it set up on Solr too. It comes with little extra configuration and no performance costs:

There is a guide to setting up Tika & extracting request handler here

他のヒント

Apply tika parser in Nutch's parsing phase.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top