Improving the speed of Solr query over 16 million tweets

Question 1

The query tweet_body:*big* can be expected to perform poorly. Trailing wildcards are easy, Leading Wildcards can be readily handled with a ReversedWildcardFilterFactory. Both, however, will have to scan every document, rather than being able to utilize the index to locate matching documents. Combining the two approaches would only allow you to search:

tweet_body:*big tweet_body:big*

Which is not the same thing. If you really must search for terms with a leading AND trailing wildcard, I would recommend looking into indexing your data as N-grams.

I wasn't previously aware of it, but it seems the lowercase field type is a Lowercase filtered KeywordAnalyzer. This is not what you want. That means the entire field is treated as a single token. Good for identification numbers and the like, but not for a body of text you wish to perform a full text search on.

So yes, you need to change it. text_general is probably appropriate. That will index a correctly tokenized field, and you should be able to performt he query you are looking for with:

tweet_body:big AND tweet_body:data AND tweet_tag:big_data

You will have to reindex, but there is no avoiding that. There is no good, performant way to perform a full text search on a keyword field.

Question 2

Try using filter queries,as filter queries runs in parallel