Lucene Custom Analyzer for indexing and query

Question

The Analysis package documentation explains how to use a CharFilter. You wrap the reader with it in your overridden initReader method.

I'm assuming the problem with your WordDelimiterFilter is that you don't know how to set the configuration options you are using? You construct an int to pass into the constructor by combining the appropriate constants with a binary and (&). such as:

int config = WordDelimiterFilter.GENERATE_NUMBER_PARTS & WordDelimiterFilter.GENERATE_WORD_PARTS; //etc.

So, in the end you might end up with something like:

//StopwordAnalyzerBase grants you some convenient ways to handle stop word sets.
public class MyAnalyzer extends StopwordAnalyzerBase {

    private final Version version = Version.LUCENE_47;
    private int wordDelimiterConfig;

    public MyAnalyzer() throws IOException {
        super(version, loadStopwordSet(new FileReader("stopwords.txt"), matchVersion));
        //Might as well load this config up front, along with the stop words
        wordDelimiterConfig = 
            WordDelimiterFilter.GENERATE_WORD_PARTS &
            WordDelimiterFilter.GENERATE_NUMBER_PARTS &
            WordDelimiterFilter.CATENATE_WORDS &
            WordDelimiterFilter.CATENATE_NUMBERS &
            WordDelimiterFilter.PRESERVE_ORIGINAL;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        Tokenizer source = new WhitespaceTokenizer(version, reader);
        TokenStream filter = new WordDelimiterFilter(source, wordDelimiterConfig, null);
        filter = new LowercaseFilterFactory(version, filter);
        filter = new StopFilter(version, filter, stopwords);
        filter = new PorterStemFilter(filter);
        return new TokenStreamComponents(source, filter);
    }

    @Override
    protected Reader initReader(String fieldName, Reader reader) {
        return new HTMLStripCharFilter(reader);
    }
}

Note: I've moved the StopFilter to after LowercaseFilter. This makes it case insensitive, as long as your stop word definitions are all in lowercase. Don't know if this is problematic due to the the WordDelimiterFilter. If so, there is a loadStopwordSet method that support case insensitivity, but I, frankly, don't know how to use it.