Question

I am editing a lucene .net implementation (2.3.2) at work to include stemming and automatic wildcarding (adding of * at the ends of words).

I have found that exact words with wildcarding don't work. (so stack* works for stackoverflow, but stackoverflow* does not get a hit), and was wondering what causes this, and how it might be fixed.

Thanks in advance. (Also thanks for not asking why I am implementing both automatic wildcarding and stemming.)

I am about to make the query always prefix query so I don't have to do any nasty adding "*"s to queries, so we will see if anything becomes clear then.

Edit: Only words that are stemmed do not work wildcarded. Example Silicate* doesn't work, but silic* does.

Was it helpful?

Solution

The reason it doesnt work is because you stem the content, thus changing the Term.

For example consider the word "valve". The snowball analyzer will stem it down to "valv".

So at search time, since you stem the input query, both "valve" and "valves" will be stemmed down to "valv". A TermQuery using the stemmed Term "valv" will yield a match on both "valve" and "valves" occurences.

But now, since in the Index you stored the Term "valv", a query for "valve*" will not match anything. That is because the QueryParser does not run the Analyzer on Wildcard Queries.

There is the AnalyzingQueryParser than can handle some of these cases, but I don't think it was in 2.3.x versions of Lucene. Anyway its not a universal fit, the documentation says:

Warning: This class should only be used with analyzers that do not use stopwords or that add tokens. Also, several stemming analyzers are inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but H?user will become h?user when using this parser and thus no match would be found (i.e. using this parser will be no improvement over QueryParser in such cases).

The solution mentionned in the duplicate I linked works for all cases, but you will get bigger indexes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top