Question

I am trying to force Solr to tokenize document on white-space, comma, : and ;. Something similar to what SQL Server Full Text search does. If I use text_general field then it tokenizes on other characters as well like ('/','\','-') , I tried using

<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,:;\s*"/>

But it doesn't tokenize it. Here is how my FieldType looks like:

<fieldType name="text_sqlserver" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,:;\s*"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,:;\s*"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Is there anything that I am missing ? I have to search for case insensitive comparison as well.

Was it helpful?

Solution

Your pattern is actually wrong, try something like this instead : pattern="[\s,;:]"

An alternative you might want to try :

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top