Question

I am trying to compare two documents in solr (say Doc A, Doc B), based on a common "name" field using solr query. Based on query A.name I get a result document B with a relevancy score of say SCR1. Now if i do it in the reverse way, i.e I query with B.name and i get the document A in somewhere in the result, but this time score of B with A is not the same SCR1.

I believe this is happening because of the no. of terms in Doc A.name and Doc B.name are different so similarity score is not same. Is it the reason for this difference?

Is there anyway I can get same score either way (as described above)?
Is it not possible to compare score of any any two queries? Is it possible to do this in native Lucene APIs?

Was it helpful?

Solution

To answer your second question, scores of two documents must not be compared.

A similar question was posted in the java-users lucene mailing list.

Here's a link to it: Compare scores across queries

An explanation is given there as why one must not do that.

OTHER TIPS

I'm not quite sure I'm clear on the queries you are referring to, but let's say the situation is something like this:

  • Doc A: Name = "Carlos Fernando Luís Maria Víctor Miguel Rafael Gabriel Gonzaga Xavier Francisco de Assis José Simão de Bragança, Sabóia Bourbon e Saxe-Coburgo-Gotha"

  • Doc B: Name = "Tomás António Gonzaga"

If you search for "gonzaga", Doc B will be given the higher score, since, while there is one match in each name, Doc B has a much shorter name, with only three terms, and shorter fields are weighed more heavily. This is the LengthNorm refered to in the TFIDFSimilarity documentation.

There are other factors though. If we just chuck each name into the queryparser, and see what comes up, something like:

Query queryA = queryparser.parse(docA.name);
Query queryB = queryparser.parse(docB.name);

Then the queries generated are much different:

name:carlos name:fernando name:luis name:maria name:victor name:miguel name:rafael name:gabriel name:gonzaga name:xavier name:francisco name:de name:assis name:jose name:simao name:de name:braganca name:baboia name:bourbon name:e name:saxe name:coburgo name:gotha

vs

name:tomas name:antonio name:gonzaga

there are a wealth of reasons why these would generate different scores. The lengthNorm discussed above, the coord factor, which boosts results which match more query terms would very likely come into play, tf, which weighs documents with more matches for a term more heavily, idf, which prefers terms that appear less frequently over the entire index, etc. etc.

Scores are only relevant to the result set of a query run. A change to the query, or to the state of the index can lead to different scores, and they are not intended to be comparable. You can use IndexSearcher.explain, to understand how a score was calculated.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top