Question

I have an application that uses lucene.Net and I am having trouble using the synonyms feature of Lucene with multiple words in a search phrase/term

For example if I want to search for the word "superman" and have setup a synonym of : "spiderman" i expect (and do) get back the results related to "spiderman" as well as "superman"

Now what I want is to search for "Justice League" and have a synonym setup for that term as "The Avengers".

and also say "Superman" and the synonym "Justice League".

You kinda get where I am going with this. I want to in summary have the ability to setup multi phrase synonyms. I am aware synonyms are indeed 1 word to 1 word, but is there any custom approach with Lucene.NET or Lucene itself in general people use to get around this problem. I heard lucene was adding this feature in but I havent seen anything thus far whilst looking around that I find useful.

Thanks Ed

Was it helpful?

Solution

Look at solr.SynonymFilterFactory

Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit") The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arrise at query time:

  1. The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
  2. Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top