Word does not get analysed properly using StemmerOverrideFilterFactory and SnowballPorterFilterFactory for Dutch language

StackOverflow https://stackoverflow.com/questions/22451774

Question

Solr: 3.5

Hi,

I created a dutch field type according to the following fieldType definition:

    <fieldType name="text_nl" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0" preserveOriginal="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.StemmerOverrideFilterFactory" words="lang/stemdict_nl.txt"  ignoreCase="true"/>
            <filter class="solr.SnowballPorterFilterFactory" language="Kp" words="lang/stemdict_nl.txt"/>
        </analyzer>
    </fieldType>

stemdict_nl.txt is using 45710 word rules according to the http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html algorithm.

Most of the search queries seem to be working fine and I am getting mostly correct suggestions.

However there is an issue when I search on 'etiketje'. According to my rules:

etiket                        etiket
etiketten                     etiket
etiketteren                   etiketteer
etikettering                  etiketteer
etiketje                      etiket

It should fallback on 'etiket'. Except however it fallsback on 'etik'. When I analyse my field, SOLR returns:

etiketje
etiketje
etiketje
etiketje
etik

I would love for SOLR to analyse 'Etiketje' as:

etiketje
etiket

Hopefully someone here can point me in the right direction.

Was it helpful?

Solution

Try changing your definition to the exact syntax as shown on the wiki i.e. change

<filter class="solr.StemmerOverrideFilterFactory" 
        words="lang/stemdict_nl.txt"  ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" 
        language="Kp" words="lang/stemdict_nl.txt"/>

to

<filter class="solr.StemmerOverrideFilterFactory" 
        dictionary="lang/stemdict_nl.txt"/>
<filter class="solr.SnowballPorterFilterFactory" 
        language="Kp"/>

You do not need ignoreCase=true on the StemmerOverrideFilter since you are using LowerCaseFilter before that filter anyway.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top