Vectorized/Matrix calculation between 2 Pandas dataframe columns

Question

Edited in response to your comment:

In [164]: df = pd.DataFrame({'col1': ['maria','fred','john'], 'col2': ['mary','orange','maria']})

Makes all the combos (maria, mary), (maria, orange), (maria, maria), (fred ...)

In [165]: combos = itertools.product(df.col1, df.col2)

combos will be a flat list of tuples like ('maria', 'mary') ..., 9 in total. Since we need the best match for each name we need to group the tuples by the name from col1.

In [166]: groups = [list(g) for k, g in itertools.groupby(combos, lambda x: x[0])]

Now we have a list of three lists: [[('maria', 'mary'), ('maria', 'orange'), ('maria', 'maria')], [...]]. The second argument to groupby is the key that breaks up the groups. Check out the itertools docs.

In [167]: groups
Out[167]: 
[[('maria', 'mary'), ('maria', 'orange'), ('maria', 'maria')],
 [('fred', 'mary'), ('fred', 'orange'), ('fred', 'maria')],
 [('john', 'mary'), ('john', 'orange'), ('john', 'maria')]]

Define a helper function:

def get_best(group):
    k = group[0][0]
    ratios = {x[1]: difflib.SequenceMatcher(None, *x).ratio() for x in group}
    winner = max(ratios.iteritems(), key=lambda x: x[1])
    return winner[1] # mess with this to return original name, mathcihng name, ratio

This is the function you'll apply to each of the lists in groups. Just like before we hand of the pair to SequenceMatcher to get the ratio. Only now we need to keep the name around. So in that function x is a tuple like ('maria', 'mary'). We need to know the name in the best match and the ratio of the best match, so I threw them in a dict with {name: ratio}. The other thing here is that max takes a second argument. This time it's just saying the thing to maximize is x[1], the ratio.

And get the best matches:

In [173]: best = [get_best(group) for group in groups]

In [175]: df['best_match'] = best

In [176]: df
Out[176]: 
    col1    col2 best_match
0  maria    mary      maria
1   fred  orange     orange
2   john   maria     orange

[3 rows x 3 columns]

This should be fairly efficient.