Question

I'm using Lucene.net 3.0.3 and I don't understand how stop words are supposed to work in queries.

I have this text as input : Stop the word

I used a StandardAnalyzer(Version.LUCENE_30) for index and querying. The index has one field Title

The field is set to be stored and analyzed. I played with different TermVector, too. For query parsing I use the simple QueryParser.Parse and

This is the issue :

  • Query for Title:stop the word returns the doc - this is good
  • Query for Title:stop word does not returns the doc - I was expecting it since stop words are removed from index.

Then I switched to LUCENE_24 and the second query actually returns the document.

I understood that stop-words are removed from the index, and probably from query, but it seems that I'm missing something basic.

Was it helpful?

Solution

While the word is indeed removed during analysis, the default behavior after 2.9, it appears, is to have position increments enabled. So, while the word is removed and will not contribute to scores, in phrase queries it is expected that "stop" and "word" will have a (removed) term between them. In Lucene 2.4, this functionality existed, but was turned off by default. You can see this in the implementation of StopFilter.getEnablePositionIncrementsVersionDefault:

public static boolean getEnablePositionIncrementsVersionDefault(Version matchVersion) {
    return matchVersion.onOrAfter(Version.LUCENE_29);
}

If you were to try, for instance, searching for "stop into word", I expect you would see a hit with version 3.0.

The PositionIncrementAttribute documentation briefly gives the idea:

Set it to values greater than one to inhibit exact phrase matches. If, for example, one does not want phrases to match across removed stop words, then one could build a stop word filter that removes stop words and also sets the increment to the number of stop words removed before each non-stop word. Then exact phrase queries will only match when the terms occur with no intervening stop words.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top