A token filter to accomplish this does indeed already exist! Take a look at EdgeNGramTokenFilter. An Analyzer
using it might look something like:
Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
KeywordTokenizer source = new KeywordTokenizer(reader);
LowercaseFilter filter = new LowercaseFilter(source);
filter = new EdgeNGramTokenFilter(filter, EdgeNGramTokenFilter.Side.BACK, 2, 50);
return new TokenStreamComponents(source, filter);
}
};
For your consideration, WordDelimiterTokenizer
might also prove useful to you. It has a number of configuartion options, and can be used to separate at punctuation and at transitions from letter to number, etc. So with it, you could get the from your input: "EMG1090-5S"
You could get the tokens:
- EMG
- 1090
- 5
- S
Which might work well for your case, but would not be particularly helpful in finding something like: "MG1"