Make lucene treat all terms in a field as a single term

https://stackoverflow.com/questions/606576

lucene

03-07-2019
|

Question

In my Lucene documents I have a field "company" where the company name is tokenized. I need the tokenization for a certain part of my application. But for this query, I need to be able to create a PrefixQuery over the whole company field.

Example:

My Brand
- my
- brand
brahmin farm
- brahmin
- farm

Regularly querying for "bra" would return both documents because they both have a term starting with bra.
The result I want though, would only return the last entry because the first term starts with bra.

Any suggestions?

Solution

Use a SpanQuery to only search the first term position. A PrefixQuery wrapped by SpanMultiTermQueryWrapper wrapped by SpanPositionRangeQuery:

<SpanPositionRangeQuery: spanPosRange(SpanMultiTermQueryWrapper(company:bra*), 0, 1)>

OTHER TIPS

Create another indexed field, where the company name is not tokenized. When necessary, search on that field rather than the tokenized company name field.

If you want fast searches, you need to have index entries that point directly at the records of interest. There might be something that you can to with the proximity data to filter records, but it will be slow. I see the problem as: how can a "contains" query over a complete field be performed efficiently?

You might be able to minimize the increase in index size by creating (for each current field) a "first term" field and "remaining terms" field. This would eliminate duplication of the first term in two fields. For "normal" queries, you look for query terms in either of these fields. For "startswith" queries, you search only the "first term" field. But this seems like more trouble than it's worth.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow