Question

I'm using the difflib ratio to calculate the similarity between 2 strings:

ratio = difflib.SequenceMatcher(None, 'string1', 'string2').ratio()

The output is a single float value from 0-1 which can be interpreted as the match score.

What I'm trying to do is create a column which contains the best match based on max(ratio) between the value and a list of other values.

So if:

df.col1 = 'maria','fred','john'

and:

df2.col1 = 'mary','orange','maria'

df.bestmatch would contain the best match for 'maria', 'fred' and 'john' based on the df2.col1 values.

I feel like this is possible using the .apply method, but I just can't wrap my head around how to calculate each value in df.col1 against df2.col1.

UPDATE: the difflib.get_close_matches method was able to handle large arrays much better and gave me everything I wanted except the ratio score (not a big deal). Tom's answer below worked for smaller datasets, but got a MemoryError when each column was ~19,000 values.

Was it helpful?

Solution

Edited in response to your comment:

In [164]: df = pd.DataFrame({'col1': ['maria','fred','john'], 'col2': ['mary','orange','maria']})

Makes all the combos (maria, mary), (maria, orange), (maria, maria), (fred ...)

In [165]: combos = itertools.product(df.col1, df.col2)

combos will be a flat list of tuples like ('maria', 'mary') ..., 9 in total. Since we need the best match for each name we need to group the tuples by the name from col1.

In [166]: groups = [list(g) for k, g in itertools.groupby(combos, lambda x: x[0])]

Now we have a list of three lists: [[('maria', 'mary'), ('maria', 'orange'), ('maria', 'maria')], [...]]. The second argument to groupby is the key that breaks up the groups. Check out the itertools docs.

In [167]: groups
Out[167]: 
[[('maria', 'mary'), ('maria', 'orange'), ('maria', 'maria')],
 [('fred', 'mary'), ('fred', 'orange'), ('fred', 'maria')],
 [('john', 'mary'), ('john', 'orange'), ('john', 'maria')]]

Define a helper function:

def get_best(group):
    k = group[0][0]
    ratios = {x[1]: difflib.SequenceMatcher(None, *x).ratio() for x in group}
    winner = max(ratios.iteritems(), key=lambda x: x[1])
    return winner[1] # mess with this to return original name, mathcihng name, ratio

This is the function you'll apply to each of the lists in groups. Just like before we hand of the pair to SequenceMatcher to get the ratio. Only now we need to keep the name around. So in that function x is a tuple like ('maria', 'mary'). We need to know the name in the best match and the ratio of the best match, so I threw them in a dict with {name: ratio}. The other thing here is that max takes a second argument. This time it's just saying the thing to maximize is x[1], the ratio.

And get the best matches:

In [173]: best = [get_best(group) for group in groups]

In [175]: df['best_match'] = best

In [176]: df
Out[176]: 
    col1    col2 best_match
0  maria    mary      maria
1   fred  orange     orange
2   john   maria     orange

[3 rows x 3 columns]

This should be fairly efficient.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top