Edited in response to your comment:
In [164]: df = pd.DataFrame({'col1': ['maria','fred','john'], 'col2': ['mary','orange','maria']})
Makes all the combos (maria, mary), (maria, orange), (maria, maria), (fred ...)
In [165]: combos = itertools.product(df.col1, df.col2)
combos
will be a flat list of tuples like ('maria', 'mary') ...,
9 in total. Since we need the best match for each name we need to group the tuples by the name from col1
.
In [166]: groups = [list(g) for k, g in itertools.groupby(combos, lambda x: x[0])]
Now we have a list of three lists: [[('maria', 'mary'), ('maria', 'orange'), ('maria', 'maria')], [...]]
. The second argument to groupby
is the key that breaks up the groups. Check out the itertools docs.
In [167]: groups
Out[167]:
[[('maria', 'mary'), ('maria', 'orange'), ('maria', 'maria')],
[('fred', 'mary'), ('fred', 'orange'), ('fred', 'maria')],
[('john', 'mary'), ('john', 'orange'), ('john', 'maria')]]
Define a helper function:
def get_best(group):
k = group[0][0]
ratios = {x[1]: difflib.SequenceMatcher(None, *x).ratio() for x in group}
winner = max(ratios.iteritems(), key=lambda x: x[1])
return winner[1] # mess with this to return original name, mathcihng name, ratio
This is the function you'll apply to each of the lists in groups
. Just like before we hand of the pair to SequenceMatcher
to get the ratio. Only now we need to keep the name around. So in that function x
is a tuple like ('maria', 'mary')
. We need to know the name in the best match and the ratio of the best match, so I threw them in a dict with {name: ratio}
. The other thing here is that max
takes a second argument. This time it's just saying the thing to maximize is x[1]
, the ratio.
And get the best matches:
In [173]: best = [get_best(group) for group in groups]
In [175]: df['best_match'] = best
In [176]: df
Out[176]:
col1 col2 best_match
0 maria mary maria
1 fred orange orange
2 john maria orange
[3 rows x 3 columns]
This should be fairly efficient.