Question

I'm using levenshtein algorithm to meet these requirements:

When finding a word of N characters, the words to suggest as correction in my dictionary database are:

Every dictionary word of N characters that has 1 character of difference with the found word. Example: found word:bearn, dictionary word: bears

Every dictionary word of N+1 characters that has N characters equal to the found word. Example: found word:bear, dictionary word: bears

Every dictionary word of N-1 characters that has N-1 characters equal to the found word. Example: found word: bears, dictionary word: bear

I'm using this implementation of Levenshtein algorithm in C++ to find when a word has a Levenshtein number of 1 (which is the Levenshtein number for all three cases), but then how do I choose the word to suggest? I read about Boyer-Moore-Horspool and Knuth-Morris-Pratt but I'm not sure on how either of them can be helpful.

#include <string>
#include <vector>
#include <algorithm>

using namespace std;

int levenshtein(const string &s1, const string &s2)
{
   string::size_type N1 = s1.length();
   string::size_type N2 = s2.length();
   string::size_type i, j;
   vector<int> T(N2+1);

   for ( i = 0; i <= N2; i++ )
      T[i] = i;

   for ( i = 0; i < N1; i++ ) {
      T[0] = i+1;
      int corner = i;
      for ( j = 0; j < N2; j++ ) {
         int upper = T[j+1];
         if ( s1[i] == s2[j] )
            T[j+1] = corner;
         else
            T[j+1] = min(T[j], min(upper, corner)) + 1;
         corner = upper;
      }
   }
   return T[N2];
}
Was it helpful?

Solution

You may also want to add Norvig's excellent article on spelling correction to your reading.

It's been a while since I've read it but I remember it being very similar to what your writing about.

OTHER TIPS

As I've said elsewhere, Boyer-Moore isn't really apt for this. Since you want to search for multiple stings simultanously, the algorithm of Wu and Manber should be more to your liking.

I've posted a proof of concept C++ code in answer to another question. Heed the caveats mentioned there.

Why restrict the suggestion to a single word, why not include a set of words? If you are restricted to a single word, you can order your results by some pre-calculated frequency of usage or something. This frequency could be updated based on what users select from the suggestion.

Also, in the case where there isn't a spelling error in the original word, you might want to prioritize the N+1 cases, which would be more like an autocomplete. Anyway I don't think there is one correct way to do it, maybe if your requirements are more specific, it would be easier to narrow down.

Also, you don't need to know Python to understand the algorithms described in Norvig's article.

If I understand you correctly, then there is no correct answer to your question. You are identifying up to three suggestions for a given word using Levenshtein - it is up to you to come up with a rule to decide which one to use and which ones to filter out. Or perhaps you should use them all?

Just as a matter of interest, the Damerau extension to Levenshtein might be of interest to you, where two swapped characters are also considered to give a score of 1, instead of 2, which is what vanilla Levenshtein returns.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top