How can I set up Solr to tokenize on whitespace and punctuation?

https://stackoverflow.com/questions/3891054

28-09-2019
|

Question

I have been trying to get my Solr schema (using Solr 1.3.0) to create terms that are tokenized by whitespace and punctuation. Here are some examples on what I would like to see happen:

terms given -> terms tokenized

foo-bar -> foo,bar
one2three4 -> one2three4
multiple words/and some-punctuation -> multiple,words,and,some,punctuation

I thought that this combination would work:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"/>
  </analyzer
<fieldType>

The problem is that this results in the following for letter to number transitions:

one2three4 -> one,2,three,4

I have tried various combinations of WordDelimiterFilterFactory settings, but none have proven useful. Is there a filter or tokenizer that can handle what I require?

Solution

how about

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnNumerics="0" />

that should prevent one2three4 to be split

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow