Difference between fuzzy like this and more like this?

https://stackoverflow.com/questions/19364756

30-06-2022
|

Question

What is the difference between Lucene's MoreLikeThis (mlt) and FuzzyQuery (flt)?

I am evaluating both query types through Elasticsearch (ES) and I found they are conceptually very similar:

mlt: compare an existing documents fields with other documents' fields vs
flt: compare a string with other documents' fields

However, flt performance seems to be about an order of magnitude slower than the mlt query.

I'm using the latest ES, which in turn uses Lucene 4.5.

From the fuzzy like this docs:

Fuzzifies ALL terms provided as strings and then picks the best n differentiating terms. In effect this mixes the behaviour of FuzzyQuery and MoreLikeThis but with special consideration of fuzzy scoring factors. This generally produces good results for queries where users may provide details in a number of fields and have no knowledge of boolean query syntax and also want a degree of fuzzy matching and a fast query.

For each source term the fuzzy variants are held in a BooleanQuery with no coord factor (because we are not looking for matches on multiple variants in any one doc). Additionally, a specialized TermQuery is used for variants and does not use that variant term’s IDF because this would favor rarer terms, such as misspellings. Instead, all variants use the same IDF ranking (the one for the source query term) and this is factored into the variant’s boost. If the source query term does not exist in the index the average IDF of the variants is used.

Solution

You are comparing the more like this query with the fuzzy like this query. Although the latter adds some fuzziness to the "more like this" query, it is not the same as the fuzzy query, which is used underneath though.

The "more like this" one allows you to specify a like_text and a list of fields. As a result, documents that contain that text in the specified fields are going to be returned. You can tweak the frequency of the terms to control when documents are going to be returned or ignored, so that you get back documents that are similar and interesting enough depending on your requirements.

The "fuzzy like this" has a similar structure and is in fact a more like this query which also uses a fuzzy query internally to find similar documents. That means that the returned documents will not only contain the terms you requested for in the like_text, but also similar terms, applying some fuzziness to them. The reason why it is slower is in fact the fuzzy query, which is more expensive, although it improved a lot with Lucene 4.x.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow