سؤال

I am trying to device an algorithm that performs error correction in names. My approach is having a database with the correct names, compute edit distance between each of them and the name entered and then suggest the 5 or 10 closest.

This task is significantly different from standard error correction in words as some of the names might be replaced by initials. For instance "Jonathan Smith" and "J. Smith" are actually quite close and could easily be considered the same name, so the edit distance should be really small if not 0. Another challenge is that some names might be written differently while sounding the same. For instance Shnaider and Schneider are versions of the same name written by people with different locales(there are better examples for that I guess). And another case - just imagine all the possible errors in writing Jawaharlal Nehru most of which have nothing to do with the real name. Again probably most of them will be similar phonetically.

Obviously Lucene's error correction algorithm will not help me here as it does not handle the above cases.

So my question is: do you know any library capable of doing error correction in names? Can you propose some algorithm for handling the cases mentioned above?

I am interested in libraries in c++ or java. As for algorithm proposals any language or pseudo code will do.

هل كانت مفيدة؟

المحلول

For phonetic matching, see Soundex.

I think modifying a Levenshtein distance algorithm to treat "abbreviate to an initial" and "expand from an initial" as single-distance edits ought to be straightforward, but the details are beyond me at the moment.

نصائح أخرى

You might also look at Metaphone.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top