Lucene analyzer for substrings

https://stackoverflow.com/questions/17419072

02-06-2022
|

Question

I have a database table with about 40,000 records containing code fields, such as FLEFSU25B-25M EMG1090-5S

I need to be able to very quickly select all codes that contain a given substring. For example "109" matches EMG1090-5S.

My current approach is to store the codes in Lucene and have Lucene filter by substring - such as 109 But that is not very efficient if I just store the codes, because than Lucene has to search through all the tokens.

To overcome this, I'm thinking of creating a new analyzer that will split each code into tokens, like this: EMG1090-5S
MG1090-5S
G1090-5S
1090-5S
...

Then to find all codes with substring 109, I can search on 109* which is much more efficient (I understand Lucene stores tokens alphabetically, just like SQL Server indexes).

Does this make sense? Does such an analyzer already exist? I'm using .Net/C#.

Solution

A token filter to accomplish this does indeed already exist! Take a look at EdgeNGramTokenFilter. An Analyzer using it might look something like:

Analyzer analyzer = new Analyzer() {
 @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    KeywordTokenizer source = new KeywordTokenizer(reader);
    LowercaseFilter filter = new LowercaseFilter(source);
    filter = new EdgeNGramTokenFilter(filter, EdgeNGramTokenFilter.Side.BACK, 2, 50);
    return new TokenStreamComponents(source, filter);
  }
};

For your consideration, WordDelimiterTokenizer might also prove useful to you. It has a number of configuartion options, and can be used to separate at punctuation and at transitions from letter to number, etc. So with it, you could get the from your input: "EMG1090-5S"

You could get the tokens:

EMG
1090
5
S

Which might work well for your case, but would not be particularly helpful in finding something like: "MG1"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow