Levenshtein DFA in .NET

https://stackoverflow.com/questions/3977182

09-10-2019
|

Question

Good afternoon,

Does anyone know of an "out-of-the-box" implementation of Levenshtein DFA (deterministic finite automata) in .NET (or easily translatable to it)? I have a very big dictionary with more than 160000 different words, and I want to, given an inicial word w, find all known words at Levenshtein distance at most 2 of w in an efficient way.

Of course, having a function which computes all possible edits at edit distance one of a given word and applying it again to each of these edits solves the problem (and in a pretty straightforwad way). The problem is effiency --- given a 7 letter word, this can already take over 1 second to complete, and I need something much more efficient --- if possible, as it is with Levenshtein DFAs, a solution that takes O(|w|) steps.

Edit: I know I can construct my own approach to the problem with a little bit of studying, but at the moment I can't afford reading Schulz and Mihov's 60-page-long articles.

Thank you very much.

Solution

We implemented this for apache lucene java, perhaps you could convert it to C# and save yourself time.

the main class is here: its just a builder to get Levenshtein DFAs from a string, using the Schulz and Mihov algorithm.

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java

the parametric descriptions (the precomputed tables) for Lev1 and Lev2 are here: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java

you might notice these are generated with a computer, we generated them with this script, using Jean-Phillipe Barrette's great moman implementation (python) http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/createLevAutomata.py

we generate the parametric descriptions as packed long[] arrays so that it won't make our jar file too large.

just modify the toAutomaton(int n) to fit your needs/DFA package. in our case we are using a modified form of the brics automaton package, where transitions are represented as unicode codepoint ranges.

efficient unit tests are difficult for this sort of thing, but here is what we came up with... it seems to be thorough and even found a bug (which was fixed immediately by the author!) in the moman implementation.

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestLevenshteinAutomata.java

OTHER TIPS

Here you go.

/// <summary>
/// Levenshtein Distance Calculator
/// </summary>
public static int DistanceFrom(this string s, string t)
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    // Step 1
    if (n == 0)
        return m;

    if (m == 0)
        return n;

    // Step 2
    for(int i = 0; i <= n; d[i, 0] = i++) ;
    for(int j = 0; j <= m; d[0, j] = j++) ;

    // Step 3
    for (int i = 1; i <= n; i++)
    {
        //Step 4
        for (int j = 1; j <= m; j++)
        {
            // Step 5
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

            // Step 6
            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    // Step 7
    return d[n, m];
}

I ported the relevant Lucene Java code as suggested by Robert Muir to C#. As far as the question goes and "out of the box": it is a work in progress but the code appears¹ to work and can probably be optimized² further, although it performs very well indeed.

You can find it here: https://github.com/mjvh80/LevenshteinDFA/ .

UPDATE: It appears that Lucene.NET is not in fact dead (yet?) and I noticed they now have a ported version of this code too. I would thus recommend looking there (https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Core/Util/Automaton/LevenshteinAutomata.cs) for an implementation of this.

¹ the code needs more tests
² because it's java ported to C# perhaps and because I wrote naive replacements of some classes (e.g. bitset).

I understand you want to find near matches in a big dictionary. Here's the way I do it. link.

From what I'm able to figure out about DFA, I can't see how it's any better, or even actually any different, under the skin. NFAs might be faster, but that's because they don't exist. Maybe I'm wrong.

Nick Johnson has a very detailed blog post about the construction of a Levenshtein automaton in Python, and the code is here. It is a good read, and I have used a slightly modified version of the code that I found efficient.

The answer of Mike Dunlavey is good too. I wonder what is the most efficient in this case, a trie search or a Levenshtein DFA?

I'd just like to point out that as of now, the Levenshtein Automaton implementations in both Lucene and Lucene.Net make use of files containing parametric state tables (tables of abstract states which describe the concrete states in an automaton) created using Moman.

If you want a solution capable of constructing such tables from scratch in memory, you might want to have a look at LevenshteinAutomaton. It's in Java, but it is well-structured, easy to follow, and extensively commented, and as such should be easier to port to C# than the current Lucene implementation. It is also maintained by moi.

* Fun fact: I submitted LevenshteinAutomaton as a replacement, or as a reference for a replacement, to the current Levenshthein Automaton implementation in Lucene... 3 years ago.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow