Question

Looking for advice on handling ampersands and the word "and" in Lucene queries. My test queries are (including quotes):

  • "oil and gas field" (complete phrase)
  • "research and development" (complete phrase)
  • "r&d" (complete phrase)

Ideally, I'd like to use the QueryParser as the input is coming from the user.

During testing and doc reading, I found that using the StandardAnalyzer doesn't work for what I want. For the first two queries, a QueryParser.Parse converts them to:

contents:"oil gas field"
contents:"research development"

Which isn't what I want. If I use a PhraseQuery instead, I get no results (presumably because "and" isn't indexed.

If I use a SimpleAnalyzer, then I can find the phrases but QueryParser.Parse converts the last term to:

contents:"r d"

Which again, isn't quite what I'm looking for.

Any advice?

Was it helpful?

Solution

if you want to search for "and" you have to index it. Write you own Analyzer or remove "and" from the list of stop words. The same applies to the "r&d". Write your own Analyzer that creates 3 words from the text: "r", "d", "r&d".

OTHER TIPS

Step one of working with Lucene is to accept that pretty much all of the work is done at the time of indexing. If you want to search for something then you index it. If you want to ignore something then you don't index it. It is this that allows Lucene to provide such high speed searching.

The upshot of this is that for an index to work effectively you have to anticipate what your analyzer needs to do up front. In this case I would write my own analyzer that doesn't strip any stop words and also transforms & to 'and' (and optionally @ to 'at' etc). In the case of r&d matching research & development you are almost certainly going to have to implement some domain specific logic.

There are other ways of dealing with this. If you can differentiate between phrase searches and normal keyword searches then there is no reason you can't maintain two or more indexes to handle different types of search. This gives very quick searching but will require some more maintenance.

Another option is to use the high speed of Lucene to filter your initial results down to something more manageable using an analyzer that doesn't give false negatives. You can then run some detailed filtering over the full text of those documents that it does find to match the correct phrases.

Ultimately I think you are going to find that Lucene sacrifices accuracy in more advanced searches in order to provide speed, it is generally good enough for most people. You are probably in uncharted waters trying to tweak your analyzer this much.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top