Solr: fieldNorm different per document, with no document boost

https://stackoverflow.com/questions/3102895

29-09-2019
|

Question

I want my search results to order by score, which they are doing, but the score is being calculated improperly. This is to say, not necessarily improperly, but differently than expected and I'm not sure why. My goal is to remove whatever is changing the score.

If I perform a search that matches on two objects (where ObjectA is expected to have a higher score than ObjectB), ObjectB is being returned first.

Let's say, for this example, that my query is a single term: "apples".

ObjectA's title: "apples are apples" (2/3 terms)
ObjectA's description: "There were apples in the apples-apples and now the apples went all apples all over the apples!" (6/18 terms)
ObjectB's title: "apples are great" (1/3 terms)
ObjectB's description: "There were apples in the apples-room and now the apples went all bad all over the apples!" (4/18 terms)

The title field has no boost (or rather, a boost of 1) and the description field has a boost of 0.8. I have not specified a document boost through solrconfig.xml or through the query that I'm passing through. If there is another way to specify a document boost, there is the chance that I'm missing one.

After analyzing the explain printout, it looks like ObjectA is properly calculating a higher score than ObjectB, just like I want, except for one difference: ObjectB's title fieldNorm is always higher than ObjectA's.

Here follows the explain printout. Just so you know: the title field is mditem5_tns and the description field is mditem7_tns:

ObjectB:
1.3327172 = (MATCH) sum of:
  1.0352166 = (MATCH) max plus 0.1 times others of:
    0.9766194 = (MATCH) weight(mditem5_tns:appl in 0), product of:
      0.53929156 = queryWeight(mditem5_tns:appl), product of:
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.2977981 = queryNorm
      1.8109303 = (MATCH) fieldWeight(mditem5_tns:appl in 0), product of:
        1.0 = tf(termFreq(mditem5_tns:appl)=1)
        1.8109303 = idf(docFreq=3, maxDocs=9)
        1.0 = fieldNorm(field=mditem5_tns, doc=0)
    0.58597165 = (MATCH) weight(mditem7_tns:appl^0.8 in 0), product of:
      0.43143326 = queryWeight(mditem7_tns:appl^0.8), product of:
        0.8 = boost
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.2977981 = queryNorm
      1.3581977 = (MATCH) fieldWeight(mditem7_tns:appl in 0), product of:
        2.0 = tf(termFreq(mditem7_tns:appl)=4)
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.375 = fieldNorm(field=mditem7_tns, doc=0)
  0.2975006 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(lastmodified)))+1000.0)), product of:
    0.999001 = 1000.0/(1.0*float(1)+1000.0)
    1.0 = boost
    0.2977981 = queryNorm

ObjectA:
1.2324848 = (MATCH) sum of:
  0.93498427 = (MATCH) max plus 0.1 times others of:
    0.8632177 = (MATCH) weight(mditem5_tns:appl in 0), product of:
      0.53929156 = queryWeight(mditem5_tns:appl), product of:
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.2977981 = queryNorm
      1.6006513 = (MATCH) fieldWeight(mditem5_tns:appl in 0), product of:
        1.4142135 = tf(termFreq(mditem5_tns:appl)=2)
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.625 = fieldNorm(field=mditem5_tns, doc=0)
    0.7176658 = (MATCH) weight(mditem7_tns:appl^0.8 in 0), product of:
      0.43143326 = queryWeight(mditem7_tns:appl^0.8), product of:
        0.8 = boost
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.2977981 = queryNorm
      1.6634457 = (MATCH) fieldWeight(mditem7_tns:appl in 0), product of:
        2.4494898 = tf(termFreq(mditem7_tns:appl)=6)
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.375 = fieldNorm(field=mditem7_tns, doc=0)
  0.2975006 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(lastmodified)))+1000.0)), product of:
    0.999001 = 1000.0/(1.0*float(1)+1000.0)
    1.0 = boost
    0.2977981 = queryNorm

Solution

The problem is caused by the stemmer. It expands "apples are apples" to "apples appl are apples appl" thus making the field longer. As document B only contains 1 term that is being expanded by the stemmer the field stays shorter then document A.

This results in different fieldNorms.

OTHER TIPS

FieldNOrm is computed of 3 components - index-time boost on the field, index-time boost on the document and field length. Assuming that you are not supplying any index-time boost, the difference must be field length.

Thus, since lengthNorm is higher for shorter field values, for B to have a higher fieldNorm value for the title, it must have smaller number of tokens in the title than A.

See the following pages for a detailed explanation of Lucene scoring:

http://lucene.apache.org/java/2_4_0/scoring.html http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow