質問

Iam trying to index documents to solrj. iam using Solr4.5 and i have huge files to be indexed . what are the ways to index each files in order to avoid performance bottleneck.

役に立ちましたか?

解決

First thing to check is server side log and look for messages about commits. It's possible you are doing a hard commit after parsing each file. That's expensive. You could look into soft commits or commitWithin params to have files show up slightly later.

Secondly, you seem to be sending a request to Solr to fetch your file and run Tika extract on it. So, this probably restarts Tika inside Solr every time. You will not be able to batch that as other answers seem to suggest.

But you could run Tika locally in your client and initialize it once and keep it around. That then gives more flexibility on how to construct your SolrInputDocument, which you can then batch.

他のヒント

Update for each document is slow with solr.

You are much better with adding all the documents, and then doing a commit with update. Taken from the solr wiki:

Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add( doc1 );
docs.add( doc2 );

UpdateRequest req = new UpdateRequest();
req.setAction( UpdateRequest.ACTION.COMMIT, false, false );
req.add( docs );
UpdateResponse rsp = req.process( server );
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top