solr boost relevant documents with more content

https://stackoverflow.com/questions/13167689

24-07-2021
|

Question

I have documents containing few words, few sentences and documents containing a lot of text. When the user searches something, solr gives first the docs with least text as most relevant and at the end gives the docs with the most text. But concering the user, the relevance should be different. The first results should be relevant, but also needs to contain more text because the user needs to get the most relevant docs but with more text - to read something.

So how can I get relevant docs first, but those with more text first, not those with several words. I am using one text field and search inside it.

Solution

The DefaultSimilarity class used by Lucene, has a scoring algorithm, has a lengthNorm calcluation which boosts the text with less content over the ones with more content.
Basically based on the Number of Terms.
You can easily extend the Similarity class to provide a custom implementation for LengthNorm which renders the calculation based on NumOfTerms ineffective.
This class then can be then specified in the schema.xml for the core to use it.

"Sweet one computes to a constant norm for all lengths in the [min,max] range (the "sweet spot"), and smaller norm values for lengths out of this range. Documents shorter or longer than the sweet spot range are "punished"

The default of min and max is 1, so its not working for you. Try to set the values e.g. :-

 <similarity class="org.apache.lucene.misc.SweetSpotSimilarity"> 
   <str name="paramkey">param value</str> 
 </similarity>

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow