Solr's SnowballPorterFilterFactory and Wildcard parameters

https://stackoverflow.com/questions/3317084

28-09-2020
|

Question

I'm having an issue querying Solr using the following field type:

<fieldType name="text_ci" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
       <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
   </analyzer>
</fieldType>

As you can see it applies the "SnowballPorterFilterFactory" when indexing and querying. If I Index something like

Mouse stuff and fun

It get's indexed as:

Index Breakdown in Solr

As you can see the word "Mouse" is turned into "Mous" by the "SnowballPorterFilterFactory". Which is what we want. However when we search for

Mouse*

It doesn't seem to apply the "SnowballPorterFilterFactory" in the same way. I guess due to the * at the end.

Query Breakdown in Solr

My question is.. Is there a way to make the "SnowballPorterFilterFactory" know about wildcards? So that when i Query for

Mouse*

I don't get 0 results.

Interestingly if i query for

mous*

The record does come back.

Or can someone offer a better way to query/index this type of field?

Thanks Dave

Solution

From the FAQ:

Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer, which is the component that performs operations such as stemming and lowercasing. The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the intended query. These queries are case-insensitive anyway because QueryParser makes them lowercase. This behavior can be changed using the setLowercaseExpandedTerms(boolean) method

If you're fine with changing your Solr source, SOLR-757 has a patch attached to it which you might find useful. I don't know of a way to change this other than diving into the source though.

What might be a simpler idea: just have a copy field which is not stemmed. The user can search both of these fields, and then mouse* will match in the non-stemmed field.

(EDIT: actually, looking at that patch, I'm not sure it will do what you want. But basically you just need to change your query handler to stem first.)

OTHER TIPS

Last time I check, when you use wildcards, the query analyzer is not used. So since you are using a LowerCaseFilterFactory, your terms are indexed in lower case and searching for Mous* won't return anything.

I think the only thing to do when you are using wildcards is to make sure to adapt your query to the way your terms are indexed (in a way similar to what your query analyzer would do).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow