Solr to Tokenize on white space, comma and period

https://stackoverflow.com/questions/21760562

11-10-2022
|

Question

I am trying to force Solr to tokenize document on white-space, comma, : and ;. Something similar to what SQL Server Full Text search does. If I use text_general field then it tokenizes on other characters as well like ('/','\','-') , I tried using

<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,:;\s*"/>

But it doesn't tokenize it. Here is how my FieldType looks like:

<fieldType name="text_sqlserver" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,:;\s*"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,:;\s*"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Is there anything that I am missing ? I have to search for case insensitive comparison as well.

Solution

Your pattern is actually wrong, try something like this instead : pattern="[\s,;:]"

An alternative you might want to try :

PatternReplaceCharFilterFactory (to replace , : ; with whitespace)
WhitespaceTokenizerFactory which tokenizes on whitespace and offers interesting options.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow