Pergunta

I'm running Apache Nutch, which seems to work and in small runs will index documents and commit to Solr at the end of the run.

Unfortunately, I want to index deep within some large sites and Nutch won't commit to the end of a run.

This has obvious issues when you're looking at 100k+ documents being stacked up waiting to commit with pressure on memory, having to wait so long for the data, etc.

Is there a way to persuade Nutch to commit more frequently?

Foi útil?

Solução

There is a configuration parameter in nutch named "solr.commit.size" which according to the description in nutch-default.xml is:

Defines the number of documents to send to Solr in a single update batch. Decrease when handling very large documents to prevent Nutch from running out of memory. NOTE: It does not explicitly trigger a server side commit.

As it said, it does not explicitly commit, because it is more optimized to left the decision of commit times to solr. So you should also tune your solr configuration parameters: autoCommit and autoSoftCommit. You can find their descriptions in solrconfig.xml file.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top