Question

I need to turn off the stemming of the EnglishAnalyzer or other similar analyzers (such as the ItalianAnalyzer, ecc..)
I'm using Lucene 3.6.2 and i saw that is only possible to specify a set of words that should not be stemmed using this constructor: EnglishAnalyzer documentation - stemExclusionSet

How can i do?

Was it helpful?

Solution

Usually when you use language-specific analysis, it's because you want stemming. StandardAnalyzer is a quite effective non-language-specific analyzer if you don't want stemming.

There are, however, some other nice little details that get handled in the language analyzers, so if you really need to just eliminate the stemmers from analyzers, grab the source of the analyzer, and create your own analyzer overriding the TokenStreamComponents method, and remove the stem filter, and associated components (you'll usually find a SetKeywordMarkerFilter which can be removed, since it is just used to prevent stemming on selected tokens), such as:

final CharArraySet defaultStopwords = new ItalianAnalyzer(Version.LUCENE_47).getStopWordSet();

final CharArraySet defaultArticles = CharArraySet.unmodifiableSet(
   new CharArraySet(Version.LUCENE_CURRENT, 
       Arrays.asList(
      "c", "l", "all", "dall", "dell", "nell", "sull", "coll", "pell", 
       "gl", "agl", "dagl", "degl", "negl", "sugl", "un", "m", "t", "s", "v", "d"
       ), true));

Analyzer customItalianAnalyzer = new Analyzer() {
  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new StandardTokenizer(Version.LUCENE_47, reader);
    TokenStream result = new StandardFilter(Version.LUCENE_47, source);
    result = new ElisionFilter(result, defaultArticles);
    result = new LowerCaseFilter(Version.LUCENE_47, result);
    result = new StopFilter(Version.LUCENE_47, result, defaultStopwords);
    return new TokenStreamComponents(source, result);
  }
};

Note, I've reproduced the stopword and ellision set definitions, here. I've also removed a version check, since in your custom usage you can specify a version rather than handling it in an if statement (assumed here you are using a version after 3.2).

Another option would be to just copy the entire contents of the ItalianAnalyzer, but I think it's healthy to give it a once over like this and get a cursory understanding of the tokenizer/filter chain, so you can make intelligent decisions about what you really want your analyzer to do.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top