Question

I am looking into the different similarity algorithms which define how the score of each document is computed during search. The available algorithms are listed here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html

My problem is that I have problems to understand them when digging through the wikipedia articles or the class descriptions in the lucene API documentation. I really like the answer about explaining the TF/IDF similarity algorithm (the default in ElasticSearch) here: What is the reasoning behind the ranking of this ElasticSearch query? (so this one I understand to a certain amount).

Can somebody provide similiar simple explanations to the other algorithms outlined there? These include:

  • bm25 similarity
  • drf similarity
  • ib similarity

Thank you in advance.

Was it helpful?

Solution

The problem you run into here, is by the description set forward in the linked answer, Lucene's default similarity, and bm25 are fundamentally identical, in that they both factor in:

  • more occurrences in the document are preferred
  • terms rarer in the corpus are preferred
  • shorter documents are more heavily weighted
  • other functions used to adjust score, boosts, etc.

dfr actually encompasses 7 different base-models alone, each using a different scoring algorithm, followed by two highly configurable normalization steps. A number of configuration options fit the very general steps above, some diverge from it.

Similarly, ib allows some significant configuration as well, but generally hits the same high points, of favoring higher term frequency, favoring matches on terms that are more rare (by some description), and adjusting for document length, boosts, and other possible normalizations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top