Stopwords not getting removed - solr

https://stackoverflow.com/questions/22583294

19-06-2023
|

Question

I am new to using Solr and have defined the following schema:

<schema name="example" version="1.5">
<fields>
    <field name="nodeId" type="string" indexed="true" stored="true" />
    <field name="_root_" type="string" indexed="true" stored="false" />
    <field name="datetime" type="string" indexed="true" stored="true"
        multiValued="true" />
    <field name="epochSecs" type="string" indexed="true" stored="true"
                    multiValued="true" />
    <field name="subject" type="text_general" indexed="true"
        stored="true" />
    <field name="body" type="text_general" indexed="true"
        stored="true" />
    <field name="emailId" type="string" indexed="true"
        stored="true" />
    <field name="compliantFlag" type="boolean" indexed="true"
                    stored="true" />
    <field name="_version_" type="long" indexed="true" stored="true" />
    <field name="text" type="text_general" indexed="true" stored="false"
        multiValued="true" />
    <field name="ngrams" type="myNGram" indexed="true" stored="false" required="false" />


</fields>
<uniqueKey>nodeId</uniqueKey>
<copyField source="datetime" dest="text" />
<copyField source="epochSecs" dest="text" />
<copyField source="subject" dest="text" />
<copyField source="body" dest="text" />
<copyField source="emailId" dest="text" />
<copyField source="compliantFlag" dest="text" />
<copyField source="text" dest="ngrams"/>

<types>
    <fieldType name="string" class="solr.StrField"
        sortMissingLast="true" omitNorms="true"/>
    <fieldType name="long" class="solr.TrieLongField"
                    precisionStep="0" positionIncrementGap="0" />
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
    <fieldType name="text_general" class="solr.TextField"
        positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
            <filter class="solr.PorterStemFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.PorterStemFilterFactory"/>
        </analyzer>
    </fieldType>
    <fieldType name="myNGram" stored="false" class="solr.TextField"> 
        <analyzer type="index"> 
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/> 
            <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="5"/> 
        </analyzer> 
    </fieldType>
</types>

The stopwords are not getting removed from the "body" field when indexed.

Also, how do i remove the special characters like \n from the below field using solr's analysers:

\n \n\n\nThese are the numbers Smurfit has.  \n\nP

Any help is appreciated. Thanks.

Solution

StandardTokenizer should create tokens around newlines, spaces, etc., and the stopword filter looks, at a glance, like it should be working correctly. You should probably include a LowercaseFilter above your StopwordFilter, to prevent those matches from being case sensitive, though.

I wonder if a pertinent question might be: What do you mean by "removed"? Analysis only affects the indexed representation of the field. It does not affect the stored version that you retrieve from the index in any way. It is meant to facilitate searching, not transform the stored version of the text. If you remove the word "the" through the filter, you should no longer get any hits on the word "the" while searching, but you will still see in when you retrieve the document from the index.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow