Simple explanation of different ElasticSearch similarity algorithms

https://stackoverflow.com/questions/19423423

01-07-2022
|

Question

I am looking into the different similarity algorithms which define how the score of each document is computed during search. The available algorithms are listed here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html

My problem is that I have problems to understand them when digging through the wikipedia articles or the class descriptions in the lucene API documentation. I really like the answer about explaining the TF/IDF similarity algorithm (the default in ElasticSearch) here: What is the reasoning behind the ranking of this ElasticSearch query? (so this one I understand to a certain amount).

Can somebody provide similiar simple explanations to the other algorithms outlined there? These include:

bm25 similarity
drf similarity
ib similarity

Thank you in advance.

Solution

The problem you run into here, is by the description set forward in the linked answer, Lucene's default similarity, and bm25 are fundamentally identical, in that they both factor in:

more occurrences in the document are preferred
terms rarer in the corpus are preferred
shorter documents are more heavily weighted
other functions used to adjust score, boosts, etc.

dfr actually encompasses 7 different base-models alone, each using a different scoring algorithm, followed by two highly configurable normalization steps. A number of configuration options fit the very general steps above, some diverge from it.

Similarly, ib allows some significant configuration as well, but generally hits the same high points, of favoring higher term frequency, favoring matches on terms that are more rare (by some description), and adjusting for document length, boosts, and other possible normalizations.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow