Using Lucene, there are a couple of options I would recommend.
One would be to index product ids with a KeywordAnalyzer, and then query as you suggested, with a fuzzy query.
Or, you could create a custom Analyzer, in which you add a WordDelimiterFilter
, which will create tokens at changes in case, as well as dashes and spaces (if any exist in your tokens after having been passed through the tokenizer). An important note, if you are using a StandardAnalyzer, or SimpleAnalyzer, or something similar, you will want to make sure the WordDelimiterFilter
is applied BEFORE the LowercaseFilter
. Running it through the LowercaseFilter
would, of course, prevent it being able to split terms based on camel casing. Another caution, you'll probably want to customize your StopFilter, since "I" is a common english stopword.
In a custom analyzer, you mainly just need to override createComponents()
. For example, if you wanted to add WordDelimiterFilter
functionality into the StandardAnalyzer's set of filters:
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_40,reader);
TokenStream filter = new StandardFilter(Version.LUCENE_40,tokenizer);
//Take a look at the WordDelimiterFactory API for other options on this filter's behavior
filter = new WordDelimiterFilter(filter,WordDelimiterFilter.GENERATE_WORD_PARTS,null);
filter = new LowercaseFilter(Version.LUCENE_40,filter);
//As mentioned, create a CharArraySet of your stopwords, since the default will likely cause problems for you
filter = new StopFilter(Version.LUCENE_40,filter,myStopWords);
return new TokenStreamComponents(tokenizer, filter);
}