Trouble with Mahout in Action Analyzer in Lucene 4.3

Question

Not quite sure why you have an IllegalStateException, but there are some likely possibilities. Typically your analyzer will build filters on top of the tokenizer. You do that, then create another tokenizer and pass that back instead, so the filter passed back has no direct relation to the tokenizer. Also, the Filter you have constructed is already at it's end when it's passed back, so you might try reseting it, I suppose.

The main problem though is that createComponents isn't really a great place to implement parsing logic. It's where you set up the Tokenizer and stack of Filters to do that. It would make more sense to implement your custom filtering logic in a Filter, extending TokenStream (or AttributeSource, or some such).

I think what you are looking for has already been implemented, though, in PatternReplaceCharFilter:

private final Pattern nonAlpha = Pattern.compile(".*[^a-z].*");
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer source = new StandardTokenizer(version, reader);
    TokenStream filter = new StandardFilter(version, source);
    filter = new LowerCaseFilter(version, filter);
    filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET);
    filter = new PatternReplaceCharFilter(nonAlpha, "", filter);
    return new TokenStreamComponents(source, filter);
}

or perhaps something still simpler, like this, would serve:

@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer source = new LowerCaseTokenizer(version, reader);
    TokenStream filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET);
    return new TokenStreamComponents(source, filter);
}