Question

I understand that the default term frequency (tf) is simply calculated as the sqrt of number of times a particular term being searched appears in a field. So documents containing multiple occurences of a term you are searching on will have a higher tf and hence weight.

What I'm unsure about is whether this helps increase the documents score because the weight is higher or reduces the documents score because its move the document vector away from the query vector as the book Hibernate Search in Action seems to be saying (pg 363). I confess I'm really struggling to see how the document vector model fits in with lucene scoring equation

Was it helpful?

Solution

I don't have this book to check, but basically (if we ignore the different boosts that can be set manually at indexing time), there are three reasons why the score of some document may be higher (or lower) than the score of other documents with Lucene's default scoring model and for a given query:

  • the queried term has a low document frequency (boosting the IDF part of the score),
  • the queried term has a high number of occurrences in the document (boosting the TF part of the score),
  • the queried term appears in a rather small field of the document (boosting the norm part of the score).

This means that for two documents D1 and D2 and one queried term T, if

  • T appears n times in D1,
  • T appears p > n times in D2,
  • the queried field of D2 has (almost) the same size (number of terms) as D1,

D2 will have a better score than D1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top