Damerau - Levenshtein Distance, aggiungendo una soglia

https://stackoverflow.com/questions/3841507

27-09-2019
|

Domanda

Ho la seguente implementazione, ma voglio aggiungere una soglia, quindi se il risultato sarà maggiore di ciò, smetti di calcolare e tornare.

Come ci farei?

EDIT: ecco il mio codice attuale, threshold non è ancora usato ... l'obiettivo è che viene utilizzato

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var cost = string1[i - 1] == string2[j - 1] ? 0 : 1;

                var del = d[i - 1, j] + 1;
                var ins = d[i, j - 1] + 1;
                var sub = d[i - 1, j - 1] + cost;

                d[i, j] = Math.Min(del, Math.Min(ins, sub));

                if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
                    d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
            }
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }
}

Soluzione 4

Finalmente l'ho capito ... anche se non è benefico come avevo sperato

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            var im1 = i - 1;
            var im2 = i - 2;
            var minDistance = threshold;

            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var jm1 = j - 1;
                var jm2 = j - 2;
                var cost = string1[im1] == string2[jm1] ? 0 : 1;

                var del = d[im1, j] + 1;
                var ins = d[i, jm1] + 1;
                var sub = d[im1, jm1] + cost;

                //Math.Min is slower than native code
                //d[i, j] = Math.Min(del, Math.Min(ins, sub));
                d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;

                if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
                    d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);

                if (d[i, j] < minDistance)
                    minDistance = d[i, j];
            }

            if (minDistance > threshold)
                return int.MaxValue;
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold 
            ? int.MaxValue 
            : d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }

Altri suggerimenti

Questo riguarda la tua risposta a questo: Damerau - Levenshtein Distance, aggiungendo una soglia(scusa non posso commentare perché non ho ancora 50 ripetizioni)

Penso che tu abbia fatto un errore qui. Hai inizializzato:

var minDistance = threshold;

E la tua regola di aggiornamento è:

if (d[i, j] < minDistance)
   minDistance = d[i, j];

Inoltre, i tuoi criteri di uscita precoce sono:

if (minDistance > threshold)
   return int.MaxValue;

Ora, osserva che la condizione if sopra non sarà mai vera! Dovresti piuttosto inizializzare minDistance a int.MaxValue

Ecco il modo più elegante a cui riesco a pensare. Dopo aver impostato ogni indice di D, vedere se supera la soglia. La valutazione è a tempo costante, quindi è un calo del secchio rispetto alla complessità teorica N^2 dell'algoritmo complessivo:

public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
    ...

    for (var i = 1; i <= d.GetUpperBound(0); i++)
    {
        for (var j = 1; j <= d.GetUpperBound(1); j++)
        {
            ...

            var temp = d[i,j] = Math.Min(del, Math.Min(ins, sub));

            if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
                temp = d[i,j] = Math.Min(temp, d[i - 2, j - 2] + cost);

            //Does this value exceed your threshold? if so, get out now
            if(temp > threshold) 
              return temp;
        }
    }

    return d[d.GetUpperBound(0), d.GetUpperBound(1)];
}

L'hai anche chiesto come una domanda SQL CLR UDF, quindi risponderò in quel contesto specifico: la migliore optmizzazione non verrà dall'ottimizzazione della distanza di Levenshtein, ma dalla riduzione del numero di coppie che si confronta. Sì, un algoritmo Levenshtein più veloce migliorerà le cose, ma non tanto quanto ridurre il numero di confronti da N Square (con N in milioni di file) a N*qualche fattore. La mia proposta è di confrontare solo elementi che hanno la differenza di lunghezza all'interno di un delta tollerabile. Sul tuo grande tavolo, aggiungi una colonna calcolata persistita LEN(Data) e quindi crea un indice su di esso con includere i dati:

ALTER TABLE Table ADD LenData AS LEN(Data) PERSISTED;
CREATE INDEX ndxTableLenData on Table(LenData) INCLUDE (Data);

Ora puoi limitare lo spazio del problema puro unendoti entro una differenza massima su Lenght (ad es. Dy 5), Se i tuoi dati LEN(Data) varia in modo significativo:

SELECT a.Data, b.Data, dbo.Levenshtein(a.Data, b.Data)
FROM Table A
JOIN Table B ON B.DataLen BETWEEN A.DataLen - 5 AND A.DataLen+5

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow