Efficiently Sending Two Series to a Function For Strings with an application to String Matching (Dice Coefficient)

https://datascience.stackexchange.com/questions/15130

16-10-2019
|

Question

I am using a Dice Coefficient based function to calculate the similarity of two strings:

def dice_coefficient(a,b):
    try:
        if not len(a) or not len(b): return 0.0
    except:
        return 0.0
    if a == b: return 1.0
    if len(a) == 1 or len(b) == 1: return 0.0
    a_bigram_list = [a[i:i+2] for i in range(len(a)-1)]
    b_bigram_list = [b[i:i+2] for i in range(len(b)-1)]
    a_bigram_list.sort()
    b_bigram_list.sort()
    lena = len(a_bigram_list)
    lenb = len(b_bigram_list)
    matches = i = j = 0
    while (i < lena and j < lenb):
        if a_bigram_list[i] == b_bigram_list[j]:
            matches += 2
            i += 1
            j += 1
        elif a_bigram_list[i] < b_bigram_list[j]:
            i += 1
        else:
            j += 1
    score = float(matches)/float(lena + lenb)
    return score

However, I am trying to evaluate the best match out of a large possible list, and i want to use list comprehension/map/vectorize the function calls for a whole series of strings to be matched to make this computationally efficient. However, I am having difficult getting the run time into a reasonable ballpark for even medium sized series (10K-100K elements).

I want to send two input series into/through the function, and then get the best possible match from all candidates on dflist1 against a second series: dflist2 . Ideally, but not necessarily, the return would be another series in the dflist1 dataframe return the best possible score also. I have an implementation of this working (below), but it's incredibly slow. Is it also possible to parrelelize this? I think this would be a hugely valueable problem to solve as it would perform the same function that reconcile csv currently does.

dflist1 = pd.read_csv('\\list1.csv', header = 0,encoding = "ISO-8859-1")
dflist2 = pd.read_csv('\\list2.csv', header = 0,error_bad_lines=False)
dflist1['Best Match'] = 'NA'
dflist1['Best Score'] = '0'
d = []
start = time.time()
for index, row in dflist1.iterrows():
    d=[dice_coefficient(dflist1['MasterList'][index],dflist2['TargetList'][indexx]) for indexx,rows in dflist2.itertuples()]
    dflist1['Best Match'][index]=dflist2['TargetList'][d.index(max(d))]
    dflist1['Best Score'][index]=max(d)
    print('Finished '+str(index)+' out of '+str(len(dflist1.index))+' matches after '+str(round(time.time() - start))+' seconds.')

Any help would be appreciated very much!

Solution

Your function does a lot of pythonic data crunching. In these cases numba can be useful (http://numba.pydata.org).

In the below code I split your function into two: sorting and scoring. I then converted your bigrams from strings to integers (to comply with numba datatypes) and decorated the scoring subfunction with numba's @autojit.

from numba import autojit
import numpy as np

def dice_coefficient(a,b):
    try:
        if not len(a) or not len(b): return 0.0
    except:
        return 0.0
    if a == b: return 1.0
    if len(a) == 1 or len(b) == 1: return 0.0
    a_bigram_list = [a[i:i+2] for i in range(len(a)-1)]
    b_bigram_list = [b[i:i+2] for i in range(len(b)-1)]

    a_bigram_list.sort()
    b_bigram_list.sort()

    lena = len(a_bigram_list)
    lenb = len(b_bigram_list)
    matches = i = j = 0
    while (i < lena and j < lenb):
        if a_bigram_list[i] == b_bigram_list[j]:
            matches += 2
            i += 1
            j += 1
        elif a_bigram_list[i] < b_bigram_list[j]:
            i += 1
        else:
            j += 1
    score = float(matches)/float(lena + lenb)
    return score


def dice_coefficient_new(a,b):
    try:
        if not len(a) or not len(b): return 0.0
    except:
        return 0.0
    if a == b: return 1.0
    if len(a) == 1 or len(b) == 1: return 0.0

    a_bigram_list, b_bigram_list = dice_coefficient_sorting(a,b)
    score = dice_coefficient_scoring(a_bigram_list,b_bigram_list)

    return score


def dice_coefficient_sorting(a,b):

    a = np.array([ord(i) for i in a])
    b = np.array([ord(i) for i in b])

    a_bigram_list = 256*a[:-1]+a[1:]
    b_bigram_list = 256*b[:-1]+b[1:]

    a_bigram_list.sort()
    b_bigram_list.sort()

    return a_bigram_list,b_bigram_list

@autojit(nopython=True)
def dice_coefficient_scoring(a_bigram_list,b_bigram_list):

    lena = len(a_bigram_list)
    lenb = len(b_bigram_list)
    matches = i = j = 0
    while (i < lena and j < lenb):
        if a_bigram_list[i] == b_bigram_list[j]:
            matches += 2
            i += 1
            j += 1
        elif a_bigram_list[i] < b_bigram_list[j]:
            i += 1
        else:
            j += 1
    score = float(matches)/float(lena + lenb)

    return score

Let's time it:

N = np.power(10,5)

a = ''.join([str(unichr(i)) for i in np.random.randint(97,123,N)])
b = ''.join([str(unichr(i)) for i in np.random.randint(97,123,N)])

%timeit dice_coefficient(a,b)
%timeit dice_coefficient_new(a,b)

Output:

1 loop, best of 3: 204 ms per loop
10 loops, best of 3: 52.9 ms per loop

So for 100K elements you get a speedup of 4x!

For further optimisation you could parallelise your global for loop (for example also using numba or multiprocessing).

(Note: I edited my rushed first answer which didn't work)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange