Ändern Levenshtein Entfernung Algorithmus zur Berechnung nicht alle Entfernungen

https://stackoverflow.com/questions/3866249

28-09-2019
|

Frage

Ich arbeite an einer Fuzzy-Suche und Umsetzung im Rahmen der Umsetzung sind wir Apache StringUtils.getLevenshteinDistance verwenden. Im Moment werden wir für eine bestimmte maxmimum durchschnittliche Antwortzeit für unsere Fuzzy-Suche. Nach verschiedenen Erweiterungen und mit etwas Profilierung, dem Ort, wo die meiste Zeit verbracht wird, um die Levenshtein-Distanz zu berechnen. Es nimmt auf der Suche rund 80-90% der Gesamtzeit bis Strings drei Buchstaben oder mehr.

Nun, ich weiß, gibt es einige Einschränkungen, was hier getan werden kann, aber ich habe auf vorherigen SO Fragen und auf dem Wikipedia-Link für LD gelesen, dass wenn man den Schwellenwert auf einen eingestellten maximalen Abstand bereit Grenze ist, das könnte Hilfe zügeln die Zeit auf dem Algorithmus ausgegeben, aber ich bin mir nicht sicher, wie dies genau das zu tun.

Wenn wir nur in die interessiert sind, Abstand, wenn sie kleiner ist als ein Schwelle k, dann genügt es, einen diagonalen Streifen mit einer Breite berechnen 2k + 1 in der Matrix. Auf diese Weise die Algorithmus kann in O (kl) Zeit ausgeführt werden, wobei l die Länge des kürzesten String [3].

Im Folgenden finden Sie den Original-LH-Code von StringUtils zu sehen. Danach ist meine Modifikation. Ich versuche, im Grunde aus dem i, j Diagonale (so, in meinem Beispiel zwei Diagonalen oberhalb und unterhalb der i, j diagonal) die Abstände eines Satzes Länge zu berechnen. Dies kann jedoch nicht richtig sein, wie ich es getan habe. Zum Beispiel auf dem höchsten Diagonale, geht es immer direkt über den Zellenwert zu wählen, die 0 sein wird, wenn jemand mir zeigen konnte, wie diese funktionsfähig zu machen, wie ich beschrieben habe, oder einige allgemeinen Ratschläge, wie man es machen, so wäre es sehr zu schätzen.

public static int getLevenshteinDistance(String s, String t) {
        if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }

        int n = s.length(); // length of s
        int m = t.length(); // length of t

        if (n == 0) {
            return m;
        } else if (m == 0) {
            return n;
        }

        if (n > m) {
            // swap the input strings to consume less memory
            String tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

        int p[] = new int[n+1]; //'previous' cost array, horizontally
        int d[] = new int[n+1]; // cost array, horizontally
        int _d[]; //placeholder to assist in swapping p and d

        // indexes into strings s and t
        int i; // iterates through s
        int j; // iterates through t

        char t_j; // jth character of t

        int cost; // cost

        for (i = 0; i<=n; i++) {
            p[i] = i;
        }

        for (j = 1; j<=m; j++) {
            t_j = t.charAt(j-1);
            d[0] = j;

            for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
            }

            // copy current distance counts to 'previous row' distance counts
            _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now 
        // actually has the most recent cost counts
        return p[n];
    }

Meine Änderungen (nur auf den für Schleifen):

  for (j = 1; j<=m; j++) {
        t_j = t.charAt(j-1);
        d[0] = j;

        int k = Math.max(j-2, 1);
        for (i = k; i <= Math.min(j+2, n); i++) {
            cost = s.charAt(i-1)==t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

Lösung

Das Problem mit dem Fenster der Umsetzung wird mit dem Wert auf die linke Seite des ersten Eintrags zu tun und vor dem letzten Eintrag in jeder Zeile.

Eine Möglichkeit ist, um die Werte beginnen Sie zunächst mit 1 füllen anstelle von 0, dann ignorieren Sie alle 0s, dass Sie stoßen. Sie werden 1 von Ihrer endgültigen Antwort zu subtrahieren haben.

Eine andere Möglichkeit ist es, die Einträge links von ersten und oben zuletzt mit hohen Werten zu füllen, so dass die Mindestkontrolle wird sie nie wählen. Das ist die Art, wie ich gewählt habe, als ich es den anderen Tag zu implementieren hatte:

public static int levenshtein(String s, String t, int threshold) {
    int slen = s.length();
    int tlen = t.length();

    // swap so the smaller string is t; this reduces the memory usage
    // of our buffers
    if(tlen > slen) {
        String stmp = s;
        s = t;
        t = stmp;
        int itmp = slen;
        slen = tlen;
        tlen = itmp;
    }

    // p is the previous and d is the current distance array; dtmp is used in swaps
    int[] p = new int[tlen + 1];
    int[] d = new int[tlen + 1];
    int[] dtmp;

    // the values necessary for our threshold are written; the ones after
    // must be filled with large integers since the tailing member of the threshold 
    // window in the bottom array will run min across them
    int n = 0;
    for(; n < Math.min(p.length, threshold + 1); ++n)
        p[n] = n;
    Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // this is the core of the Levenshtein edit distance algorithm
    // instead of actually building the matrix, two arrays are swapped back and forth
    // the threshold limits the amount of entries that need to be computed if we're 
    // looking for a match within a set distance
    for(int row = 1; row < s.length()+1; ++row) {
        char schar = s.charAt(row-1);
        d[0] = row;

        // set up our threshold window
        int min = Math.max(1, row - threshold);
        int max = Math.min(d.length, row + threshold + 1);

        // since we're reusing arrays, we need to be sure to wipe the value left of the
        // starting index; we don't have to worry about the value above the ending index
        // as the arrays were initially filled with large integers and we progress to the right
        if(min > 1)
            d[min-1] = Integer.MAX_VALUE;

        for(int col = min; col < max; ++col) {
            if(schar == t.charAt(col-1))
                d[col] = p[col-1];
            else 
                // min of: diagonal, left, up
                d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
        }
        // swap our arrays
        dtmp = p;
        p = d;
        d = dtmp;
    }

        if(p[tlen] == Integer.MAX_VALUE)
            return -1;
    return p[tlen];
}

Andere Tipps

Ich habe über Levenshtein Automaten geschrieben, die eine Art und Weise sind diese Art von Kontrolle in O (n) Zeit zu tun, bevor, hier . Die Source-Code-Beispiele sind in Python, aber die Erklärungen sollten hilfreich sein, und die referenzierten Papiere bieten weitere Informationen.

habe ich die Original-Code und stellt diese kurz vor dem Ende des j for-Schleife:

    if (p[n] > s.length() + 5)
        break;

Die 5 ist willkürlich, aber für unsere Zwecke, wenn die Abstände der Abfragelänge plus fünf (oder was auch immer Zahl wir sich entscheiden), ist es nicht wirklich wichtig, was zurückgegeben wird, weil wir das Spiel so einfach es doch sehr unterschiedlich betrachten . Es schneidet sich auf die Dinge ein wenig. Trotzdem ziemlich sicher, dass dies nicht die Idee, dass die Wiki Aussage über spricht, wenn jemand das besser versteht.

Laut "Gusfield, Dan (1997) Algorithmen auf Strings, Bäumen und Sequenzen. Informatik und Computational Biology". (Seite 264) Sie sollten Nullen ignorieren

Hier jemand eine sehr ähnliche Frage beantwortet:

Cite:
Ich habe es einige Male getan. So wie ich es tun, ist mit einer rekursiven Tiefen ersten Baum-Spaziergang des Spiels Baum möglicher Änderungen. Es ist ein Budget k von Änderungen, dass ich den Baum zu beschneiden verwenden. Mit dieser Routine in der Hand, zuerst betreiben ich es mit k = 0, dann 1 k =, dann k = 2 bis ich entweder einen Schlag bekommen oder ich will nicht höher gehen.

char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
  /* if the budget is exhausted, prune the search */
  if (k < 0) return false;
  /* if at end of both strings we have a match */ 
  if (ia == na && ib == nb) return true;
  /* if the first characters match, continue walking with no reduction in budget */
  if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
  /* if the first characters don't match, assume there is a 1-character replacement */
  if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
  /* try assuming there is an extra character in a */
  if (ia < na && walk(ia+1, ib, k-1)) return true;
  /* try assuming there is an extra character in b */
  if (ib < nb && walk(ia, ib+1, k-1)) return true;
  /* if none of those worked, I give up */
  return false;
}

nur der Hauptteil, mehr Code in dem ursprünglichen

Apache Commons Lang 3.4 hat diese Implementierung:

/**
 * <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
 * threshold.</p>
 *
 * <p>This is the number of changes needed to change one String into
 * another, where each change is a single character modification (deletion,
 * insertion or substitution).</p>
 *
 * <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
 * and Chas Emerick's implementation of the Levenshtein distance algorithm from
 * <a href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.htm</a></p>
 *
 * <pre>
 * StringUtils.getLevenshteinDistance(null, *, *)             = IllegalArgumentException
 * StringUtils.getLevenshteinDistance(*, null, *)             = IllegalArgumentException
 * StringUtils.getLevenshteinDistance(*, *, -1)               = IllegalArgumentException
 * StringUtils.getLevenshteinDistance("","", 0)               = 0
 * StringUtils.getLevenshteinDistance("aaapppp", "", 8)       = 7
 * StringUtils.getLevenshteinDistance("aaapppp", "", 7)       = 7
 * StringUtils.getLevenshteinDistance("aaapppp", "", 6))      = -1
 * StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
 * StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
 * StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
 * StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
 * </pre>
 *
 * @param s  the first String, must not be null
 * @param t  the second String, must not be null
 * @param threshold the target threshold, must not be negative
 * @return result distance, or {@code -1} if the distance would be greater than the threshold
 * @throws IllegalArgumentException if either String input {@code null} or negative threshold
 */
public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
    if (s == null || t == null) {
        throw new IllegalArgumentException("Strings must not be null");
    }
    if (threshold < 0) {
        throw new IllegalArgumentException("Threshold must not be negative");
    }

    /*
    This implementation only computes the distance if it's less than or equal to the
    threshold value, returning -1 if it's greater.  The advantage is performance: unbounded
    distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
    computing a diagonal stripe of width 2k + 1 of the cost table.
    It is also possible to use this to compute the unbounded Levenshtein distance by starting
    the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
    d is the distance.

    One subtlety comes from needing to ignore entries on the border of our stripe
    eg.
    p[] = |#|#|#|*
    d[] =  *|#|#|#|
    We must ignore the entry to the left of the leftmost member
    We must ignore the entry above the rightmost member

    Another subtlety comes from our stripe running off the matrix if the strings aren't
    of the same size.  Since string s is always swapped to be the shorter of the two,
    the stripe will always run off to the upper right instead of the lower left of the matrix.

    As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
    In this case we're going to walk a stripe of length 3.  The matrix would look like so:

       1 2 3 4 5
    1 |#|#| | | |
    2 |#|#|#| | |
    3 | |#|#|#| |
    4 | | |#|#|#|
    5 | | | |#|#|
    6 | | | | |#|
    7 | | | | | |

    Note how the stripe leads off the table as there is no possible way to turn a string of length 5
    into one of length 7 in edit distance of 1.

    Additionally, this implementation decreases memory usage by using two
    single-dimensional arrays and swapping them back and forth instead of allocating
    an entire n by m matrix.  This requires a few minor changes, such as immediately returning
    when it's detected that the stripe has run off the matrix and initially filling the arrays with
    large values so that entries we don't compute are ignored.

    See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
     */

    int n = s.length(); // length of s
    int m = t.length(); // length of t

    // if one string is empty, the edit distance is necessarily the length of the other
    if (n == 0) {
        return m <= threshold ? m : -1;
    } else if (m == 0) {
        return n <= threshold ? n : -1;
    }

    if (n > m) {
        // swap the two strings to consume less memory
        final CharSequence tmp = s;
        s = t;
        t = tmp;
        n = m;
        m = t.length();
    }

    int p[] = new int[n + 1]; // 'previous' cost array, horizontally
    int d[] = new int[n + 1]; // cost array, horizontally
    int _d[]; // placeholder to assist in swapping p and d

    // fill in starting table values
    final int boundary = Math.min(n, threshold) + 1;
    for (int i = 0; i < boundary; i++) {
        p[i] = i;
    }
    // these fills ensure that the value above the rightmost entry of our
    // stripe will be ignored in following loop iterations
    Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // iterates through t
    for (int j = 1; j <= m; j++) {
        final char t_j = t.charAt(j - 1); // jth character of t
        d[0] = j;

        // compute stripe indices, constrain to array size
        final int min = Math.max(1, j - threshold);
        final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);

        // the stripe may lead off of the table if s and t are of different sizes
        if (min > max) {
            return -1;
        }

        // ignore entry left of leftmost
        if (min > 1) {
            d[min - 1] = Integer.MAX_VALUE;
        }

        // iterates through [min, max] in s
        for (int i = min; i <= max; i++) {
            if (s.charAt(i - 1) == t_j) {
                // diagonally left and up
                d[i] = p[i - 1];
            } else {
                // 1 + minimum of cell to the left, to the top, diagonally left and up
                d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
            }
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

    // if p[n] is greater than the threshold, there's no guarantee on it being the correct
    // distance
    if (p[n] <= threshold) {
        return p[n];
    }
    return -1;
}

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow