I believe this is happening because you are using a StandardTokenizer
, which tokenizes the string "t-shirt" into the tokens "t" and "shirt". "t", however, is shorter than the min gram size, and so no tokens are generated from it. So you are getting the same matches in each case, but the document with t-shirt
is longer, and so scores a bit lower.
You can get detailed information on why documents are getting the scores they get, by using the Explain API.
Are you sure you need to use ngrams? Your example, "shirt" in "muscle-shirt" should be handled just fine by the StandardAnalyzer
, which will tokenize on the hyphen.