Question

Here's a text with ambiguous words: "A man saw an elephant."

Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.

For "saw" it is like:

{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}

All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.

I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.

I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.

Maybe some hacks with posting list payloads can be applied?

UPD: A draft of my solution

Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8). So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage. E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|... While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.

I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?

Was it helpful?

Solution

So... the cheater way of doing this is (indeed) to control how you build the lucene index.

When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.

One way:

This means for each type of query you do, you must also build an index in the same way.

Example:

saw becomes noun-saw -- index it as that. saw also becomes noun-past-see -- index it as that. saw also becomes noun-past-singular-see -- index it as that.

The other way:

If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.

Not sure if this is a good answer, but that's all I could think of.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top