Question

I am working on lucene 4.7 and trying to migrate one of the analyzers we use in our solr configuration.

 <analyzer> 
  <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>  
    <filter class="solr.WordDelimiterFilterFactory" 
            generateWordParts="1" 
            generateNumberParts="1" 
            catenateWords="1"
            catenateNumbers="1"
            catenateAll="0"
            splitOnCaseChange="0"
            splitOnNumerics="0"
            preserveOriginal="1"
    />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>

But, I just cannot figure out how to use the HTMLStripCharFilterFactory and the WordDelimiterFilterFactory with the configuration as above. Also, for my query in solr my analyzer is as follows, how can i achieve the same in lucene.

 <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
Was it helpful?

Solution

The Analysis package documentation explains how to use a CharFilter. You wrap the reader with it in your overridden initReader method.

I'm assuming the problem with your WordDelimiterFilter is that you don't know how to set the configuration options you are using? You construct an int to pass into the constructor by combining the appropriate constants with a binary and (&). such as:

int config = WordDelimiterFilter.GENERATE_NUMBER_PARTS & WordDelimiterFilter.GENERATE_WORD_PARTS; //etc.

So, in the end you might end up with something like:

//StopwordAnalyzerBase grants you some convenient ways to handle stop word sets.
public class MyAnalyzer extends StopwordAnalyzerBase {

    private final Version version = Version.LUCENE_47;
    private int wordDelimiterConfig;

    public MyAnalyzer() throws IOException {
        super(version, loadStopwordSet(new FileReader("stopwords.txt"), matchVersion));
        //Might as well load this config up front, along with the stop words
        wordDelimiterConfig = 
            WordDelimiterFilter.GENERATE_WORD_PARTS &
            WordDelimiterFilter.GENERATE_NUMBER_PARTS &
            WordDelimiterFilter.CATENATE_WORDS &
            WordDelimiterFilter.CATENATE_NUMBERS &
            WordDelimiterFilter.PRESERVE_ORIGINAL;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        Tokenizer source = new WhitespaceTokenizer(version, reader);
        TokenStream filter = new WordDelimiterFilter(source, wordDelimiterConfig, null);
        filter = new LowercaseFilterFactory(version, filter);
        filter = new StopFilter(version, filter, stopwords);
        filter = new PorterStemFilter(filter);
        return new TokenStreamComponents(source, filter);
    }

    @Override
    protected Reader initReader(String fieldName, Reader reader) {
        return new HTMLStripCharFilter(reader);
    }
}

Note: I've moved the StopFilter to after LowercaseFilter. This makes it case insensitive, as long as your stop word definitions are all in lowercase. Don't know if this is problematic due to the the WordDelimiterFilter. If so, there is a loadStopwordSet method that support case insensitivity, but I, frankly, don't know how to use it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top