Question

I am using Lucene 4.4 and I have a project to do. In that project all non-letters must be removed and all upper-case letters must be converted to lower-case. I know that there is an analyzer for removing non-letters.

But is there an analyzer in Lucene to both removes all non-letters and converts all upper-cases to lower-case?

Cheers.

Was it helpful?

Solution

Actually, yes, there is an analyzer that does that. SimpleAnalyzer.


The following does (almost) exactly the same thing:

Analyzer analyzer = new Analyzer() {
 @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer source = new LetterTokenizer(Version.LUCENE_44, reader);
    TokenStream filter = new LowercaseFilter(Version.LUCENE_44, source);
    return new TokenStreamComponents(source, filter);
  }
};

When you have very specific requirements for an Analyzer, often you'll need to design your own by chaining a Tokenizer and some Filters like this, and as shown in the Analyzer documentation LetterTokenizer defines a token as a maximal string of adjacent letters, and LowercaseFilter does what it says on the tin.

This is a fairly common combination, so there is also LowercaseTokenizer which does the job of both LowercaseFilter and LetterTokenizer in one step, and thus provides a performance advantage. LowercaseTokenizer is what is actually used by SimpleAnalyzer

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top