Question

Iam trying to index documents to solrj. iam using Solr4.5 and i have huge files to be indexed . what are the ways to index each files in order to avoid performance bottleneck.

Était-ce utile?

La solution

First thing to check is server side log and look for messages about commits. It's possible you are doing a hard commit after parsing each file. That's expensive. You could look into soft commits or commitWithin params to have files show up slightly later.

Secondly, you seem to be sending a request to Solr to fetch your file and run Tika extract on it. So, this probably restarts Tika inside Solr every time. You will not be able to batch that as other answers seem to suggest.

But you could run Tika locally in your client and initialize it once and keep it around. That then gives more flexibility on how to construct your SolrInputDocument, which you can then batch.

Autres conseils

Update for each document is slow with solr.

You are much better with adding all the documents, and then doing a commit with update. Taken from the solr wiki:

Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add( doc1 );
docs.add( doc2 );

UpdateRequest req = new UpdateRequest();
req.setAction( UpdateRequest.ACTION.COMMIT, false, false );
req.add( docs );
UpdateResponse rsp = req.process( server );
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top