Sunspot/Solr: non-alphabetical characters

https://stackoverflow.com/questions/11438453

20-06-2021
|

Question

I'm using Solr with Sunspot/dismax. Is it possible to query for non-alphabetic characters? I.e.:

~ ! @ # $ % ^ & * ( ) _ + - = [ ] { } | \

I'm aware that +/- must be escaped, as they are dismax inclusion/exclusion operators. But I'm getting no matches when I search for any of these characters:

Foo.search { fulltext '=' }.results.length   # => 0
Foo.search { fulltext '\=' }.results.length  # => 0

Yet:

Foo.search { fulltext 'a'}.results.length    # => 30

Here is the tokenizer config I'm using:

    <fieldType name="text" class="solr.TextField" omitNorms="false">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StandardFilterFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>

Solution

Solr's StandardTokenizer drops all 'special characters', since it's optimized to use with plain text. So for example '=' won't be found because it's being stripped from the text during indexing.

One of tokenizers that preserve all characters is WhitespaceTokenizer, which splits input only on whitespace. You need to evaluate if it's a good solution to your problem, as it will produce tokens like this:

20-year-old fox jumps over the lazy dog. -> '20-year-old', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.'

It may happen that you will need to provide your own tokenizer (not necessary by implementing one, you can define appropriate regular expression for split characters and use PatternTokenizer) or use filter like WordDelimiterFilter or PatternReplaceFilter.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow