Question

I have a bunch of people names that are tied to their respective Identifying Numbers (e.g. Social Security Number/National ID/Passport Number). Due to duplication though, one Identity Number can have upto 100 names which could be similar or totally different. E.g. ID 221 could have the names Richard Parker, Mary Parker, Aunt May, Parker Richard, M@rrrrryy Richard etc etc. Some typos but some totally different names.

Initially, I want to display only 3 (or a similar small number) of the names that are as different as possible from the rest so as to alert that viewer that the multiple names could not be typos but could be even a case of identity theft or negligent data capture or anything else!

I've read up on an algorithm to detect similarity and am currently looking at this one which would allow you to compute a score and a score of 1 means the two strings are the same while a lower score means they are dissimilar. In my use case, how can I go through say the 100 names and display the 3 that are most dissimilar? The algorithm for that just escapes my mind as I feel like I need a starting point and then look and compare among all others and loop again etc etc

Was it helpful?

Solution

Take the function from https://stackoverflow.com/a/14631287/1082673 as you mentioned and iterate over all combinations in your list. This will work if you have not that many entries, otherwise the computation time can increase pretty fast…

Here is how to generate the pairs for a given list:

import itertools

persons = ['person1', 'person2', 'person3']

for p1, p2 in itertools.combinations(persons, 2):
    print "Compare", p1, "and", p2
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top