Solr Cloud : Effect of Tf-IDF on Querying with multiple shards

https://stackoverflow.com/questions/21529573

06-10-2022
|

Question

We have a huge set of text documents that we want to index in Solr. However since the index size is way too large we decided to split it among different shards using Solr Cloud. Now as per my understanding whenever a search is performed it will be distributed over all the shards and results from all the shards will be merged and returned. However a particular shard will search only in the index that it is hosting. My question is will it affect the quality of search results as IDF which actually should be calculated over the entire of set of documents will now be calculated over just the documents in a particular shard ?

Solution

Solr does not calculate universal term/doc frequencies, it is done per node. For most large-scale implementations, it is not likely to matter that Solr calculates TD/IDF at the shard level. However, if your collection is heavily skewed in its distribution across servers, you may find misleading relevancy results in your searches. In general, it is probably best to randomly distribute documents to your shards.

More on this here: https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow