Solr: character proximity ranks misspellings higher because of inverse document frequency

https://stackoverflow.com/questions/20893293

solr
solr4

23-09-2022
|

题

I'm using character proximity to allow for some misspellings, for example:

text:manager~1

This allows both 'manager' and 'managre' to be matched. The problem is, the misspellings are always ranked higher than the proper spelling because there are fewer of those in the index. For example, let's say I have 3 documents as follows:

1) text:manager
2) text:manager
3) text:managre

Then the character proximity query above will give an inverse document frequency (idf) of 1.7 to 'managre' and 1.2 to 'manager', effectively ranking the misspelled 'managre' higher. From a technical perspective, this makes sense (there are fewer occurances of 'managre' than 'manager'), but in reality, this doesn't make sense. Is there a way to get Solr to set the idf of misspelled words to match that of the correct spelling?

解决方案

Short answers is No. Long answer is you have good options here, You need to solve this in a different way.

To begin with take the power of query time boosting. So you can query something like:

text:manager^1.2 OR text:manager~1^0.8

Here you are saying my user is smart so i will give higher boost to user query, but just incase I will give it's variance a bit lower boost. You need to do a boolean query of exact match with higher boost with a Boolean OR query of fuzzy query so that exact matches ranks higher. Do not worry about extra work for solr. It is built for very complex Lucene query trees. Using a combination of queries to get expected relevancy ranking is common practice.

TF , IDF and solr's in built relevancy ranking arbitrary and framing query with boosts, boolean queries, and context based filters is where power and flexibility of solr exists.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow