Question

We have set up an Solr index containing 36 million documents (~1K-2K each) and we try to query a maximum of 100 documents matching a single simple keyword. This works pretty fast as we had hoped for. However, if we now add "&sort=createDate+desc" to the query (thus asking for the top 100 'new' documents matching the query) it runs for a long, very long time and finally results in an OutOfMemoryException. From what I've understood from the manual this is caused by the fact that Lucene needs to load all the distinct values for this field (createDate) into memory (the FieldCache afaik) before it can execute the query. As the createDate field contains date and time the number of distinct values is pretty large. Also important to mention is that we frequently update the index.

Perhaps someone can provide some insights and directions on how we can tune Lucene / Solr or change our approach in such a way that query times become acceptable? Your input will be much appreciated! Thanks.

Was it helpful?

Solution

The problem is Lucene stores numbers as strings. There are some utilities, which split the date into YYYY, MM, DD and put them in different fields. That gives much better results.

Newer version of Lucene (2.9 onwards) support numeric fields and the performance improvements are significant (couple of orders of magnitude, IIRC.) Check this article about the numeric queries.

OTHER TIPS

You can sort the results by index order instead. The sort specification for descending by document number is:

new SortField(null, SortField.DOC, true)

You should also partition the index directories by the date field. All matching documents are examined by Lucene when collecting the top N results. The partitioning will split the examined set. You don't need to examine the older partitions, if you have N results in the newest partition.

Try converting you Date type data into String type (such as milliseconds).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top