Penalize documents with a lot of unique terms in lucene

https://stackoverflow.com/questions/13844867

07-12-2021
|

質問

My purpose is to find animals(doc) given a city (term)

I've indexed documents this way:

doc1(bear)  = [city1, city2, city2, city3..]
doc2(dog)   = [city1, city1, city1, city2, city2, city2, city3, city3, city3..]
..

I'd like to penalize (animals)documents that appear in a lot of cities, therefore documents with an high percentage of different cities/all cities like "dog".

Any suggestions? Thanks

解決

It already does!

See Similarity.computeNorm.

The norm function, by default, considers matches on shorter fields to be a more precise match, and so scores them higher than longer fields.

If you need this to have a heavier impact, you can override the DefaultSimilarity with a custom version, and modify the value returned from the computeNorm method to weigh the lengthNorm portion of the calculation more heavily. I'd recommend just adding a multiplier somewhere in the existing algorithm, if you need to do that, but tweak it however you need to.

Note! As stated in the API, this value is stored in the index, not computed at query time. You must reindex to see changes take effect.

The calculation in computeNorm (3.6.0) is:

state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)))

Where numterms is the total number of terms in the field, and state is a FieldInvertState.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow