Вопрос

Today's challenge is creating a search engine for my store's products db.

A lot of products are created by hand, by a lot of different hands!

So it's likely to find "i-phone 3gs", "iPhone4" and "i phone 5",

What I want is to search for "iPhone" and to find the three example product results above.

That reminded me the "fuzzy searches". I tried to use them out-of-the-box without success.

What I have to index and search for this kind of example (special characters or whitespaces inside a document body) to retrieve the "synonyms" results?

e.g.

iPhone => "i-phone"

"special 40" => "special40"

Это было полезно?

Решение

Using Lucene, there are a couple of options I would recommend.

One would be to index product ids with a KeywordAnalyzer, and then query as you suggested, with a fuzzy query.

Or, you could create a custom Analyzer, in which you add a WordDelimiterFilter, which will create tokens at changes in case, as well as dashes and spaces (if any exist in your tokens after having been passed through the tokenizer). An important note, if you are using a StandardAnalyzer, or SimpleAnalyzer, or something similar, you will want to make sure the WordDelimiterFilter is applied BEFORE the LowercaseFilter. Running it through the LowercaseFilter would, of course, prevent it being able to split terms based on camel casing. Another caution, you'll probably want to customize your StopFilter, since "I" is a common english stopword.

In a custom analyzer, you mainly just need to override createComponents(). For example, if you wanted to add WordDelimiterFilter functionality into the StandardAnalyzer's set of filters:

@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_40,reader);
    TokenStream filter = new StandardFilter(Version.LUCENE_40,tokenizer);
    //Take a look at the WordDelimiterFactory API for other options on this filter's behavior
    filter = new WordDelimiterFilter(filter,WordDelimiterFilter.GENERATE_WORD_PARTS,null);
    filter = new LowercaseFilter(Version.LUCENE_40,filter);
    //As mentioned, create a CharArraySet of your stopwords, since the default will likely cause problems for you
    filter = new StopFilter(Version.LUCENE_40,filter,myStopWords);
    return new TokenStreamComponents(tokenizer, filter);
}

Другие советы

With Solr, please make sure to walk through the example tutorial and corresponding schema.xml. You will see that there are two type definitions there (en_splitting and en_splitting_tight I think) that show very similar use cases.

Specifically, you are looking at WordDelimiterFilter augmented by LowerCaseFilter and possibly SynonymFilter. You do have to be a bit careful with SynonymFilters though, especially if you are mapping to/from multi-word equivalences.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top