Lucene NGram tokenizer with Queryparser

https://stackoverflow.com/questions/23121571

lucene

04-07-2023
|

Question

I've created custom trigram analyzer for fuzzy match for my project (NGramTokenizer(Version.LUCENE_44, reader, 3, 3)) -- specifying token size min 3 and max 3

During index time I am getting proper trigram tokens but when I use same analyzer during query time (by QueryParser) its skipping tokens which are less then 3 chars.

Example

Indexed Document - Hi Rushik

Indexed Tri-grams - hi_, i_r, rus, ush, shi, hik (checked it using Luke index reader)

Query - Hi Rushik AB XYZ.

Parsed Query (QueryParser result) (name_data:rus name_data:ush name_data:shi name_data:hik) name_data:xyz

As you can see, query parser removed tokens which are less then 3 chars. I understand I specified 3,3 during tokenizing but in that case indexing also should've skipped tokens less then 3 count?

I think I am missing something here, any help?

Solution

Got the Answer..

Lucene QueryParser first tokenize data by White Spaces and then analyze individual terms/tokens with analyzer. As my analyzer is NGram(3,3) it cannot generate any token on term/token of 2 chars.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow