How can I set up Solr to tokenize on whitespace and punctuation?
-
28-09-2019 - |
Question
I have been trying to get my Solr schema (using Solr 1.3.0) to create terms that are tokenized by whitespace and punctuation. Here are some examples on what I would like to see happen:
terms given -> terms tokenized
foo-bar -> foo,bar
one2three4 -> one2three4
multiple words/and some-punctuation -> multiple,words,and,some,punctuation
I thought that this combination would work:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"/>
</analyzer
<fieldType>
The problem is that this results in the following for letter to number transitions:
one2three4 -> one,2,three,4
I have tried various combinations of WordDelimiterFilterFactory
settings, but none have proven useful. Is there a filter or tokenizer that can handle what I require?
Solution
how about
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnNumerics="0" />
that should prevent one2three4 to be split
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow