Question

I use Solr (SolrCloud) to index and search my tweets. There are about 16 million tweets and the index size is approximately 3 GB. The tweets are indexed in real time as they come so that real time search is enabled. Currently I use lowercase field type for my tweet body field. For a single search term in the search, it is taking around 7 seconds and with addition of each search term, time taken for search is linearly increasing. 3GB is the maximum RAM allocated for the solr process. Sample solr search query looks like this

tweet_body:*big* AND tweet_body:*data* AND tweet_tag:big_data

Any suggestions on improving the speed of searching? Currently I run only 1 shard which contains the entire tweet collection.

Was it helpful?

Solution

The query tweet_body:*big* can be expected to perform poorly. Trailing wildcards are easy, Leading Wildcards can be readily handled with a ReversedWildcardFilterFactory. Both, however, will have to scan every document, rather than being able to utilize the index to locate matching documents. Combining the two approaches would only allow you to search:

tweet_body:*big tweet_body:big*

Which is not the same thing. If you really must search for terms with a leading AND trailing wildcard, I would recommend looking into indexing your data as N-grams.


I wasn't previously aware of it, but it seems the lowercase field type is a Lowercase filtered KeywordAnalyzer. This is not what you want. That means the entire field is treated as a single token. Good for identification numbers and the like, but not for a body of text you wish to perform a full text search on.

So yes, you need to change it. text_general is probably appropriate. That will index a correctly tokenized field, and you should be able to performt he query you are looking for with:

tweet_body:big AND tweet_body:data AND tweet_tag:big_data

You will have to reindex, but there is no avoiding that. There is no good, performant way to perform a full text search on a keyword field.

OTHER TIPS

Try using filter queries,as filter queries runs in parallel

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top