Lucene fuzzy phrase search approach with scoring

https://stackoverflow.com/questions/18662929

28-06-2022
|

Question

My requirement is to generate match score on fuzzy phrase search. Example

1) Input Data - Hello Sam, how are you doing? Thanks, Smith.

Indexed Document - Sam Smith (documents are always person/organization names and input data would be free-text data)

In above case, both Sam and Smith found in my input data but contextually both are different persons. If my input data would be "Hello Sam Smith" than I should get relevant hit with higher score (Also I am expecting OK score for "Hello Sam John Smith" and so on).

I am using Lucene here for primary filtering and later will post-process matched documents with input data and define match score (using levenshtein), and it should also work for fuzzy.

Exact approach,

1) Index documents as Tri-Grams

2) Search input free text data with Tri-Gram indexes

3) Gather all matching documents (this will have lot of noisy data yet)

4) Post process every matched document and define position of every matched tri-gram token in input free text data and calculate levenshtein score between possible position token(s) and entire document.

e.g - Hello Sam, how are you doing? Thanks, Smith.

In here my document match will be "Sam Smith", I want to look at each tri gram of index and its position match in input free text data, like

1) token "sam" matched with 2nd position word "Sam" in input data

2) token "smi" matched with 8th position word "Smith" in input data

Later I will write logic to calculate levenshtein score of token 2, 8 with actual matched document (it would be very less score considering proximity between position 2 and 8), but if position of tokens were 2,3 (or 2,4) I would have given good scoring.

Would like to get feedback from experts on this approach, or some better suggestions, Thanks.

Solution

I'm doing a similar sort of fuzzy phrase matching in Lucene using tokenized sequences. Token distances are computed with Levenshtein or JaroWinkler and then Smith-Waterman is used to find the best sequence alignment. If I were to adapt that approach to your case the trouble would be that alignment scoring doesn't have a way to (directly) favor token swaps (displaced token substitution). The only thing I could do would be to have a lower cost for insertions for tokens that appear in the source vs those that do not.

So I like the n-gram approach to get scoring that is less sensitive to non-local reordering. I suggest checking out BLEU, METEOR, and ROUGE which are standard n-gram metrics for sentence similarity with various approaches to dealing with order sensitivity. They can be used with either character-level n-grams as in your proposal or with token-level n-grams such as I'm doing.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow