How to correctly interpret solr similarity score?

https://stackoverflow.com/questions/21382258

03-10-2022
|

سؤال

I am aware that the similarity scores returned by Solr are relevant only for a specific query and that they have only relative meaning. Having said that, is there a way to determine the 'goodness' of a score in a global fashion?

For example: Suppose I run an MLT query and get 5 documents. Each document has a score but the fact is that the document with the highest score is not necessarily the most relevant. I want to be able to specify a threshold score below which I do not even consider the documents.

How can this threshold be determined? Is it only by empirical measurement, or can I say that usually, a similarity score larger than 3 gives good resemblance in content, while similarity scores smaller than 1 usually means the document is completely irrelevant? Or alternatively, can I say that results that are less than 80% of the similarity of a document to itself are irrelevant?

المحلول

For a given document, Solr may determine the interesting terms and their weights:

"interestingTerms": 
    ["field_b:foo",5.0,"field_b:bar",2.9085307,"field_b:baz",1.67070794]

which can be used to generate the following search query:

field_b:foo^5.0 field_b:bar^2.9085307 field_b:baz^1.67070794

So MLT is AFAIK a two step process that finds the interesting terms and weights of a given document and then uses those terms to do a search

See https://stackoverflow.com/a/12328229/604511 and mlt.interestingTerms in http://wiki.apache.org/solr/MoreLikeThisHandler .

Do you have a good reason for such a threshold? Just present the results to the user. If there is low similarity, the user will (and must be allowed to) overlook the results.

See the following: StackOverflow concentrates on the why does and fetches nothing about tomcat. But still SO users overlook bad MLT suggestions all the time.

enter image description here

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow