elasticsearch ngrams: why is shorter token matched instead of longer?

https://stackoverflow.com/questions/22665088

21-06-2023
|

Question

I have an index with the following mapping and analyzer:

settings: {
    analysis: {
      char_filter: {
        custom_cleaner: {
          # remove - and * (we don't want them here)
          type: "mapping",
          mappings: ["-=>", "*=>"]
        }
      },
      analyzer: {
        custom_ngram: {
          tokenizer: "standard",
          filter: [ "lowercase", "custom_ngram_filter" ],
          char_filter: ["custom_cleaner"]
        }
      },
      filter: {
        custom_ngram_filter: {
          type: "nGram",
          min_gram: 3,
          max_gram: 20,
          token_chars: [ "letter", "digit" ]
        }
      }
    }
  },
  mappings: {
    attributes: {
      properties: {
        name: { type: "string"},
        words: { type: "string", similarity: "BM25", analyzer: "custom_ngram" }
      }
    }
  }
}

And I have the following 2 documents in the index:

"name": "shirts", "words": [ "shirt"]

and

"name": "t-shirts", "words": ["t-shirt"]

I perform a multimatch query as

"query": {

            "multi_match": {
               "query": "t-shirt",
               "fields": [
                  "words",
                  "name"
               ],
               "analyzer": "custom_ngram"
            }

   }

The question:

shirts has a score of 1.17 whereas t-shirt has a score of 0.8. Why is that and how can I achieve that t-shirt (the direct match) has the higher score?

I need ngrams for another use case, where I have to detect contains matches. (shirt is in muscle-shirt, ...) Therefore I can't skip the ngrams, I guess.

Thank you!

Solution

I believe this is happening because you are using a StandardTokenizer, which tokenizes the string "t-shirt" into the tokens "t" and "shirt". "t", however, is shorter than the min gram size, and so no tokens are generated from it. So you are getting the same matches in each case, but the document with t-shirt is longer, and so scores a bit lower.

You can get detailed information on why documents are getting the scores they get, by using the Explain API.

Are you sure you need to use ngrams? Your example, "shirt" in "muscle-shirt" should be handled just fine by the StandardAnalyzer, which will tokenize on the hyphen.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow