solr - skip suggestions that return same documents as original search

https://stackoverflow.com//questions/21027891

21-12-2019
|

Question

I have search suggestions working pretty well and I like that I get suggestions even if the original keyword returned results (if we have documents with misspellings in our collection). However, often, I am getting suggestions that return the exact same results. Ex. I search for yellow mint tin, I get "Did you mean yellow mint tins?"

Is there a way to remove suggestions that return the same results as the original term?

I'm using solr 4.6.0 Here's the info from solrconfig.xml

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
        <str name="queryAnalyzerFieldType">text_general</str>
        <!-- a spellchecker built from a field of the main index -->
        <lst name="spellchecker">
          <str name="name">default</str>
          <str name="field">spell2</str>
          <str name="classname">solr.DirectSolrSpellChecker</str>
          <!-- the spellcheck distance measure used, the default is the internal levenshtein -->
          <str name="distanceMeasure">internal</str>
          <!-- minimum accuracy needed to be considered a valid spellcheck suggestion -->
          <float name="accuracy">0.1</float>
          <!-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -->
          <int name="maxEdits">2</int>
          <!-- the minimum shared prefix when enumerating terms -->
          <int name="minPrefix">0</int> <!-- if set to 1, must start with same letter -->
          <!-- maximum number of inspections per result. -->
          <int name="maxInspections">5</int>
          <!-- minimum length of a query term to be considered for correction -->
          <int name="minQueryLength">4</int>
          <!-- maximum threshold of documents a query term can appear to be considered for correction -->
          <float name="maxQueryFrequency">0.01</float>
        </lst>
        <!-- a spellchecker that can break or combine words.  See "/spell" handler below for usage -->
        <lst name="spellchecker">
          <str name="name">wordbreak</str>
          <str name="classname">solr.WordBreakSolrSpellChecker</str>
          <str name="field">spell2</str>
          <str name="combineWords">true</str>
          <str name="breakWords">true</str>
          <int name="maxChanges">10</int>
          <str name="buildOnCommit">true</str>
          <int name="minBreakLength">3</int>
        </lst>
      </searchComponent>

     <requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
        <lst name="defaults">
            <str name="echoParams">none</str>
           <int name="rows">10</int>
          <str name="df">contents</str>
          <str name="defType">edismax</str>
          <str name="spellcheck.dictionary">default</str>
          <str name="spellcheck.dictionary">wordbreak</str>
          <str name="spellcheck">on</str>
          <str name="spellcheck.extendedResults">false</str>       
          <str name="spellcheck.count">10</str>
          <str name="spellcheck.alternativeTermCount">25</str>
          <str name="spellcheck.maxResultsForSuggest">25</str>
          <str name="spellcheck.collate">true</str>
          <str name="spellcheck.maxCollationTries">10</str>
          <str name="spellcheck.maxCollations">5</str>         
          <str name="spellcheck.onlyMorePopular">false</str>
          <str name="spellcheck.collateParam.defType">dismax</str>
        </lst>
        <arr name="last-components">
          <str>spellcheck</str>
        </arr>
      </requestHandler>

Here's the info from schema.xml

 <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

<field name="spell2" type="text_general" indexed="true" stored="false" required="false" multiValued="true" />

An example query - http://localhost:8985/solr/(collection)/spell?q=yellow%20buttermints returns

<str name="collation">yellow (butter mints)</str> 
  <str name="collation">yellow buttermint</str>

"yellow buttermints" and "yellow buttermint" return the same results.

Solution

I don't think there is a definite way to guarantee this. But this should definitely help -

Add this filter both at query and index time - EnglishMinimalStemFilterFactory

https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-EnglishMinimalStemFilter
I am not sure if how would SynonymFilterFactory work in this case. You could try it without it too

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow