Applying a custom CharTokenizer in Solr4

https://stackoverflow.com/questions/13183729

26-07-2021
|

Question

I just wrote a custom CharTokenizer, and I want to use it in my Solr server.

In Solr3, I could just extend TokenizerFactory and return my CharTokenizer in the create method, but TokenizerFactory does not exist in Solr4.

So, I was noticed that I should replace TokenizerFactory with TokenFilterFactory, but in this case, I cannot return my custom CharTokenizer, because the parameters don't match.

I also search for some documentation, but looks like there is nothing really useful about that out there.

So, how can I make it works?

Example:

public class MyCustomTokenizer extends CharTokenizer {

  char anotherSpace = 24;

  public MyCustomTokenizer(Version matchVersion, Reader in) {
    super(matchVersion, in);
  }

  protected boolean isTokenChar(int c) {
    return !Character.isWhitespace(c) && isToken((char) c);
  }

  private boolean isToken(char c) {
    if (c == anotherSpace || c == ',') {
        return false;
    }
    return true;
  }
}

public class MyCustomTokenizerFactory extends TokenFilterFactory {

  public void init(Map<String, String> args) {
    super.init(args);
    assureMatchVersion();
  }

  @Override
  public TokenStream create(TokenStream input) {
      // sh*t happens here
    return new MyCustomTokenizer(luceneMatchVersion, input);
  }
}

Thanks in advance.

Solution

The best way to check for implementation is looking the Source code of an existing Tokenizer in Lucene.

Example :-

WhitespaceTokenizer
WhitespaceTokenizerFactory

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow