質問

How can i disable counting of deletion, in this implementation of Damerau-Levenshtein distance algorithm, or if there is other algorithm already implemented please point me to it.

Example(disabled deletion counting):

string1: how are you?

string2: how oyu?

distance: 1 (for transposition, 4 deletes doesn't count)

And here is the algorithm:

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer_transpositionRow
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            var im1 = i - 1;
            var im2 = i - 2;
            var minDistance = threshold;
            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var jm1 = j - 1;
                var jm2 = j - 2;
                var cost = string1[im1] == string2[jm1] ? 0 : 1;

                var del = d[im1, j] + 1;
                var ins = d[i, jm1] + 1;
                var sub = d[im1, jm1] + cost;

                //Math.Min is slower than native code
                //d[i, j] = Math.Min(del, Math.Min(ins, sub));
                d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;

                if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
                    d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);

                if (d[i, j] < minDistance)
                    minDistance = d[i, j];
            }

            if (minDistance > threshold)
                return int.MaxValue;
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold
            ? int.MaxValue
            : d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }
役に立ちましたか?

解決

 public static int DamerauLevenshteinDistance( string string1
                                            , string string2
                                            , int threshold)
{
    // Return trivial case - where they are equal
    if (string1.Equals(string2))
        return 0;

    // Return trivial case - where one is empty
    // WRONG FOR YOUR NEEDS: 
    // if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
    //      return (string1 ?? "").Length + (string2 ?? "").Length;

    //DO IT THIS WAY:
    if (String.IsNullOrEmpty(string1))
        // First string is empty, so every character of 
        // String2 has been inserted:
        return (string2 ?? "").Length;
    if (String.IsNullOrEmpty(string2))
        // Second string is empty, so every character of string1 
        // has been deleted, but you dont count deletions:
        return 0;

    // DO NOT SWAP THE STRINGS IF YOU WANT TO DEAL WITH INSERTIONS
    // IN A DIFFERENT MANNER THEN WITH DELETIONS:
    // THE FOLLOWING IS WRONG FOR YOUR NEEDS:
    // // Ensure string2 (inner cycle) is longer_transpositionRow
    // if (string1.Length > string2.Length)
    // {
    //     var tmp = string1;
    //     string1 = string2;
    //     string2 = tmp;
    // }

    // Return trivial case - where string1 is contained within string2
    if (string2.Contains(string1))
        //all changes are insertions
        return string2.Length - string1.Length;

    // REVERSE CASE: STRING2 IS CONTAINED WITHIN STRING1
    if (string1.Contains(string2))
        //all changes are deletions which you don't count:
        return 0;

    var length1 = string1.Length;
    var length2 = string2.Length;


    // PAY ATTENTION TO THIS CHANGE!
    // length1+1 rows is way too much! You need only 3 rows (0, 1 and 2)
    // read my explanation below the code!
    // TOO MUCH ROWS: var d = new int[length1 + 1, length2 + 1];
    var d = new int[2, length2 + 1];

    // THIS INITIALIZATION COUNTS DELETIONS. YOU DONT WANT IT
    // or (var i = 0; i <= d.GetUpperBound(0); i++)
    //    d[i, 0] = i;

    // But you must initiate the first element of each row with 0:
    for (var i = 0; i <= 2; i++)
        d[i, 0] = 0;


    // This initialization counts insertions. You need it, but for
    // better consistency of code I call the variable j (not i):
    for (var j = 0; j <= d.GetUpperBound(1); j++)
        d[0, j] = j;


    // Now do the job:
    // for (var i = 1; i <= d.GetUpperBound(0); i++)
    for (var i = 1; i <= length1; i++)
    {
        //Here in this for-loop: add "%3" to evey term 
        // that is used as first index of d!

        var im1 = i - 1;
        var im2 = i - 2;
        var minDistance = threshold;
        for (var j = 1; j <= d.GetUpperBound(1); j++)
        {
            var jm1 = j - 1;
            var jm2 = j - 2;
            var cost = string1[im1] == string2[jm1] ? 0 : 1;

            // DON'T COUNT DELETIONS!  var del = d[im1, j] + 1;
            var ins = d[i % 3, jm1] + 1;
            var sub = d[im1 % 3, jm1] + cost;

            // Math.Min is slower than native code
            // d[i, j] = Math.Min(del, Math.Min(ins, sub));
            // DEL DOES NOT EXIST  
            // d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;
            d[i % 3, j] = ins <= sub ? ins : sub;

            if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
                d[i % 3, j] = Math.Min(d[i % 3, j], d[im2 % 3, jm2] + cost);

            if (d[i % 3, j] < minDistance)
                minDistance = d[i % 3, j];
        }

        if (minDistance > threshold)
            return int.MaxValue;
    }

    return d[length1 % 3, d.GetUpperBound(1)] > threshold
        ? int.MaxValue
        : d[length1 % 3, d.GetUpperBound(1)];
}

here comes my explanation why you need only 3 rows:

Look at this line:

var d = new int[length1 + 1, length2 + 1];

If one string has the length n and the other has the length m, then your code needs a space of (n+1)*(m+1) integers. Each Integer needs 4 Byte. This is waste of memory if your strings are long. If both strings are 35.000 byte long, you will need more than 4 GB of memory!

In this code you calculate and write a new value for d[i,j]. And to do this, you read values from its upper neighbor (d[i,jm1]), from its left neighbor (d[im1,j]), from its upper-left neighbor (d[im1,jm1]) and finally from its double-upper-double-left neighbour (d[im2,jm2]). So you just need values from your actual row and 2 rows before.

You never need values from any other row. So why do you want to store them? Three rows are enough, and my changes make shure, that you can work with this 3 rows without reading any wrong value at any time.

他のヒント

I would advise not rewriting this specific algorithm to handle specific cases of "free" edits. Many of them radically simplify the concept of the problem to the point where the metric will not convey any useful information.

For example, when substitution is free the distance between all strings is the difference between their lengths. Simply transmute the smaller string into the prefix of the larger string and add the needed letters. (You can guarantee that there is no smaller distance because one insertion is required for each character of edit distance.)

When transposition is free the question reduces to determining the sum of differences of letter counts. (Since the distance between all anagrams is 0, sorting the letters in each string and exchanging out or removing the non-common elements of the larger string is the best strategy. The mathematical argument is similar to that of the previous example.)

In the case when insertion and deletion are free the edit distance between any two strings is zero. If only insertion OR deletion is free this breaks the symmetry of the distance metric - with free deletions, the distance from a to aa is 1, while the distance from aa to a is 1. Depending on the application this could possibly be desirable; but I'm not sure if it's something you're interested in. You will need to greatly alter the presented algorithm because it makes the mentioned assumption of one string always being longer than the other.

Try to change var del = d[im1, j] + 1; to var del = d[im1, j];, I think that solves your problem.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top