Why restricts Lucene's MoreLikeThis it's TermQueries to the field with the highest docFreq?

https://stackoverflow.com/questions/13500066

01-12-2021
|

Question

I'm currently working on a modified version of Lucenes MoreLikeThis, to fit my own purposes. There is one thing i still can't understand. When creating the queue, MoreLikeThis searches for the field with the highest docFreq for this term.

// go through all the fields and find the largest document frequency
String topField = fieldNames[0];
int docFreq = 0;
for (int i = 0; i < fieldNames.length; i++) {
   int freq = ir.docFreq(new Term(fieldNames[i], word));
   topField = (freq > docFreq) ? fieldNames[i] : topField;
   docFreq = (freq > docFreq) ? freq : docFreq;
}

This field will be used in the TermQuery. This can produce strange results.

For example, imagine you have two fields, "title" and "body", and there are two documents with the exact same title, but they won't be a match because all words from the "title" occur more often in other documents "body"s, and vice versa. That seems pretty odd to me.

Another example: I use it in a system which filters the results by user-dependent access permissions, and there it happened that the user for whom the query was generated could not see the documents which were responsible for the high docFreq of the chosen field. The generated query didn't find any documents, although there were plenty docs the user could see, containing the exact terms, just in the wrong field.

I wonder why they don't just use all fields, or at least the fields in which the terms occur originally. Sure, it may be a performance issue. But I've implemented it to use all the fields where the term occurs in the original document, plus the one with the highest docFreq. I tested it on an index with several thousand documents and could not see any difference (but i didn't do any benchmarks).

So, can anybody tell me why it's implemented this way? The only reason i can think of, is to be performant on a really big index with lots of fields.

//EDIT: I implemented the first example to clarify the problem: http://pastebin.com/fwdENb3F

Solution

You should view MoreLikeThis as a reference implementation that doesn't fit all uses. If the implementation would have targeted one field only, then we'll be seeing questions like: why it searches only the title field and totally misses out that the two book documents have the same author.
You could use setFieldNames to set which fields to find similarity by.

Creating your own version of MoreLikeThis sound like the best approach, especially given that you need to factor in ACLs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow