The process which Lucene tokenizes text

https://stackoverflow.com/questions/4684960

11-10-2019
|

Question

This can be considered as a general Java question but for better understanding I'm using Lucene as example.

You can use different Tokenizers in Lucene to tokenize text. There's the main abstract Tokenizer class and then many different classes that extend it. Same thing for TokenFilter.

Now, it seems that each time you want to index a document, a new Tokenizer is created. The question is, since Tokeinzer is just a utility class, why not make it static? for example, a Tokenizer that converts all letters to lower case can have a static method that does just that for every input it gets. What's the point of creating a new object for every piece of text we want to index?

One thing to mention - Tokeinzer has a private field that contains the input it receives to tokenize. I just don't see why we need to store it this way because the object is destroyed right after the tokenization process is over and the new tokenized text is returned. The only thing I can think of is multi-threaded access maybe?

Thank you!

Solution

Now, it seems that each time you want to index a document, a new Tokenizer is created

This is not true, the Analyzer.reusableTokenStream method is called, which re-uses not just a Tokenizer, but also the entire chain (TokenFilters, etc). See http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/analysis/Analyzer.html#reusableTokenStream(java.lang.String, java.io.Reader)

One thing to mention - Tokeinzer has a private field that contains the input it receives to tokenize. I just don't see why we need to store it this way because the object is destroyed right after the tokenization process is over and the new tokenized text is returned. The only thing I can think of is multi-threaded access maybe?

As mentioned earlier, the entire Chain of tokenizers and tokenfilters is reused across documents. So all of their Attributes are reused, but also its important to note that attributes are shared across the chain (e.g. all Tokenizers and TokenFilters' Attribute references point to the same instances). This is why it is crucial to call clearAttributes() in your tokenizer to reset all attributes.

As an example, a Whitespace tokenizer adds a reference to a TermAttribute in its ctor, and its wrapped by a LowerCaseFilter which adds a reference to a TermAttribute in its ctor, too. Both these TermAttributes point to the same underlying char[]. When a new document is processed, Analyzer.reusableTokenStream is invoked, which returns the same TokenStream chain (in this case Whitespace wrapped with LowerCaseFilter) used in the previous document. The reset(Reader) method is called, resetting the tokenizer's input to the new document contents. Finally reset() is called on the entire stream, which resets any internal state from the previous document, and the contents are processed until incrementToken() returns false.

OTHER TIPS

Dont worry about creating an instance here and there of a class when doing something complex like indexing a document w/ Lucene. There is probably going to be lots and lots of objects created inside the tokenizing and indexing process. One more tokeniser instance is literally nothing when one compares the left over garbage from thrown away objects when the process completes. If you dont believe me get a profile and watch object creation counts.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow