tokenizer for keepwordfilterfactory in solr

https://stackoverflow.com/questions/14479473

17-01-2022
|

Question

I want to use the solr keepwordfilterfactory but not getting the appropriate tokenizer for that. Use case is, i have a string say hi i am coming, bla-bla go out. Now from the following string i want to keep the words like hi i, coming,,bla-blaetc. So what tokenizer to use with the filter factory so that i am able to get any such combination in facets. Tried different tokenizer but not getting the exact result. I am using solr 4.0. Is there any such tokenizer that tokenizes based on the keepwords used.

Solution

What are your 'rules' for tokenization (splitting long text into individual tokens). The example above seem to be implying that sometimes you have single word tokens and sometimes a multi-word ("hi i"). The multi-word case is problematic here, but you might be able to do it by combining ShingleFilterFactory to give you multi-word tokens as well as the original ones and then you keep only the items you want.

I am not sure whether KeepWord filter deals correctly with multi-word strings. If it does not, you may want to have a special separator character during shingle process and then regex filter it back to space as the last step.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow