Looking for similar words

https://stackoverflow.com/questions/10065769

30-05-2021
|

質問

I'm trying to write a spellchecker module.

It loads a text, creates a dictionary from 16 mb file and then checks if encountered word is similar to the word in dictionary (similar = varies up to two chars) if so then it changes it to the form from dictionary.

Right now I'm using a Levenshtein Distance algorithm and processing of a 50 words set takes 3 min...

I'm pretty sure that there must be a faster solution. Profiler told me that my app spends more than 80% of it's time in Levenshtein Distance function.

Are there any better solutions/algorithms?

Here is the implemented of version of the algorithm I use:

def levenshteinDistance(s1, s2):
    l_s1 = len(s1)
    l_s2 = len(s2)
    d = [[a for a in genM(x, l_s2 + 1)] for x in xrange(l_s1 + 1)]
    for i in xrange(1, l_s1 + 1):
        for j in xrange(1, l_s2 + 1):
            d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + decide_of_equality(s1[i - 1],s2[j - 1]))
    return d[l_s1][l_s2]

解決

I have used Norvig's spell corrector, mentioned in the comments and it is awesome.

However coming to your problem, you have written a dynamic programming edit distance Algorithm. Your algorithm qualifies to be a data parallel algorithm. On a shared memory i.e on a single machine if you have multi cores you can exploit them. Do you know something called map-reduce? Please dont think distributed and all right now, just consider one single quad core machine and a shared memory. As a step 1 you can partition your dictionary and allocate a portion to each thread which will run edit distance on a portion of dictionary (similar to a map step). Later all your threads will return you all the words at an edit distance of 2 (similar to reduce step). This way your program will benefit from multi core architecture.

Another thing I could think of is inside your python code write the cpu intensive edit distance algorithm in C i.e by writing a python extension.

他のヒント

Maybe the problem is at a higher level. When a profiler tells you that a lot of time is spent in a function, it might be that you're calling it too often. Are you perhaps comparing each word in the text to each word in the dictionary? Try it the other way around: for words in the text, directly generate words of distance <= 2 and check if they're in the dictionary.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow