Ein besserer Ähnlichkeitsranking-Algorithmus für Zeichenfolgen variabler Länge

https://stackoverflow.com/questions/653157

19-08-2019
|

Frage

Ich suche nach einem String-Ähnlichkeitsalgorithmus, der bei Strings variabler Länge bessere Ergebnisse liefert als die normalerweise vorgeschlagenen (Levenshtein-Abstand, Soundex usw.).

Zum Beispiel,

Gegebene Zeichenfolge A:"Robert",

Dann String B:„Amy Robertson“

würde besser passen als

Zeichenfolge C:„Richard“

Außerdem sollte dieser Algorithmus vorzugsweise sprachunabhängig sein (funktioniert auch in anderen Sprachen als Englisch).

Lösung

Simon White von Catalysoft schrieb einen Artikel über einen sehr cleveren Algorithmus, der benachbarten Zeichenpaaren vergleicht, die wirklich gut für meine Zwecke funktioniert:

http://www.catalysoft.com/articles/StrikeAMatch.html

Simon hat eine Java-Version des Algorithmus und unter mir eine PL / Ruby-Version davon (aus der Ebene Ruby-Version in dem zugehörigen Forum Eintrag Kommentar von Mark Wong-vanHaren gemacht) so geschrieben, dass ich es in meinem verwenden kann PostgreSQL-Abfragen:

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '

str1.downcase! 
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
  |pair| pair.include? " "}
str2.downcase! 
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
  |pair| pair.include? " "}
union = pairs1.size + pairs2.size 
intersection = 0 
pairs1.each do |p1| 
  0.upto(pairs2.size-1) do |i| 
    if p1 == pairs2[i] 
      intersection += 1 
      pairs2.slice!(i) 
      break 
    end 
  end 
end 
(2.0 * intersection) / union

' LANGUAGE 'plruby';

Funktioniert wie ein Zauber!

Andere Tipps

Marzagão Antwort ist groß. Ich baute es zu C #, so dachte ich, ich es hier posten würde:

Pastebin Link-

/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
    /// <summary>
    /// Compares the two strings based on letter pair matches
    /// </summary>
    /// <param name="str1"></param>
    /// <param name="str2"></param>
    /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
    public double CompareStrings(string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }
}

Hier ist eine andere Version von Marzagão der Antwort, diesmal in Python geschrieben:

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    s = string.lower()
    return [s[i:i+2] for i in list(range(len(s) - 1))]

def string_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

if __name__ == "__main__":
    """
    Run a test using the example taken from:
    http://www.catalysoft.com/articles/StrikeAMatch.html
    """
    w1 = 'Healed'
    words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']

    for w2 in words:
        print('Healed --- ' + w2)
        print(string_similarity(w1, w2))
        print()

Hier ist meine PHP Implementierung der vorgeschlagenen StrikeAMatch Algorithmus von Simon White. die Vorteile (wie er sagt, in der Verbindung) sind:

Ein getreues Spiegelbild der lexikalischen Ähnlichkeit - Saiten mit kleinen Unterschieden sollen als ähnlich erkannt werden. Insbesondere eine deutliche Teilzeichenfolge Überlappung zu einem hohen Grad der Ähnlichkeit zwischen den Saiten zeigen sollte.
Eine Robustheit gegenüber Änderungen der Wortfolge - zwei Saiten, die die gleichen Worte enthalten, aber in einer anderen Reihenfolge, sollte als ähnlich erkannt werden. Auf der anderen Seite, wenn eine Zeichenfolge nur ein zufälliges Anagramm der enthaltenen Zeichen in dem anderes ist, dann sollte es (in der Regel) als unähnlich erkannt werden.
Sprache Independence -. Der Algorithmus nicht nur in Englisch arbeiten, aber in vielen verschiedenen Sprachen

<?php
/**
 * LetterPairSimilarity algorithm implementation in PHP
 * @author Igal Alkon
 * @link http://www.catalysoft.com/articles/StrikeAMatch.html
 */
class LetterPairSimilarity
{
    /**
     * @param $str
     * @return mixed
     */
    private function wordLetterPairs($str)
    {
        $allPairs = array();

        // Tokenize the string and put the tokens/words into an array

        $words = explode(' ', $str);

        // For each word
        for ($w = 0; $w < count($words); $w++)
        {
            // Find the pairs of characters
            $pairsInWord = $this->letterPairs($words[$w]);

            for ($p = 0; $p < count($pairsInWord); $p++)
            {
                $allPairs[] = $pairsInWord[$p];
            }
        }

        return $allPairs;
    }

    /**
     * @param $str
     * @return array
     */
    private function letterPairs($str)
    {
        $numPairs = mb_strlen($str)-1;
        $pairs = array();

        for ($i = 0; $i < $numPairs; $i++)
        {
            $pairs[$i] = mb_substr($str,$i,2);
        }

        return $pairs;
    }

    /**
     * @param $str1
     * @param $str2
     * @return float
     */
    public function compareStrings($str1, $str2)
    {
        $pairs1 = $this->wordLetterPairs(strtoupper($str1));
        $pairs2 = $this->wordLetterPairs(strtoupper($str2));

        $intersection = 0;

        $union = count($pairs1) + count($pairs2);

        for ($i=0; $i < count($pairs1); $i++)
        {
            $pair1 = $pairs1[$i];

            $pairs2 = array_values($pairs2);
            for($j = 0; $j < count($pairs2); $j++)
            {
                $pair2 = $pairs2[$j];
                if ($pair1 === $pair2)
                {
                    $intersection++;
                    unset($pairs2[$j]);
                    break;
                }
            }
        }

        return (2.0*$intersection)/$union;
    }
}

Eine kürzere Version von John Rutledges Antwort:

def get_bigrams(string):
    '''
    Takes a string and returns a list of bigrams
    '''
    s = string.lower()
    return {s[i:i+2] for i in xrange(len(s) - 1)}

def string_similarity(str1, str2):
    '''
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    '''
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))

Diese Diskussion war wirklich hilfreich, danke. Ich konvertiert den Algorithmus zu VBA für Excel und schrieb die andere zum Vergleichen einer Zeichenfolge in einen Bereich / Array von Strings ein paar Versionen einer Tabellenfunktion, eine für einen einfachen Vergleich von zwei Strings. Die strSimLookup Version gibt entweder die letzte beste Übereinstimmung als String, Array-Index oder Ahnlichkeitsmetrik.

Diese Implementierung erzeugt die gleichen im Amazonas Beispiel aufgeführt Ergebnisse auf Simon White Webseite mit ein paar kleinen Ausnahmen auf Low-Scoring-Spielen; nicht sicher, wo der Unterschied in kriecht, Split-Funktion des VBA sein könnte, aber ich habe nicht untersucht, da es für meine Zwecke in Ordnung funktioniert.

'Implements functions to rate how similar two strings are on
'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar)
'Source:   http://www.catalysoft.com/articles/StrikeAMatch.html
'Author: Bob Chatham, bob.chatham at gmail.com
'9/12/2010

Option Explicit

Public Function stringSimilarity(str1 As String, str2 As String) As Variant
'Simple version of the algorithm that computes the similiarity metric
'between two strings.
'NOTE: This verision is not efficient to use if you're comparing one string
'with a range of other values as it will needlessly calculate the pairs for the
'first string over an over again; use the array-optimized version for this case.

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection

    Set sPairs1 = New Collection
    Set sPairs2 = New Collection

    WordLetterPairs str1, sPairs1
    WordLetterPairs str2, sPairs2

    stringSimilarity = SimilarityMetric(sPairs1, sPairs2)

    Set sPairs1 = Nothing
    Set sPairs2 = Nothing

End Function

Public Function strSimA(str1 As Variant, rRng As Range) As Variant
'Return an array of string similarity indexes for str1 vs every string in input range rRng
    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim arrOut As Variant
    Dim l As Long, j As Long

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    l = rRng.Count
    ReDim arrOut(1 To l)
    For j = 1 To l
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(j)), sPairs2
        arrOut(j) = SimilarityMetric(sPairs1, sPairs2)
        Set sPairs2 = Nothing
    Next j

    strSimA = Application.Transpose(arrOut)

End Function

Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant
'Return either the best match or the index of the best match
'depending on returnTYype parameter) between str1 and strings in rRng)
' returnType = 0 or omitted: returns the best matching string
' returnType = 1           : returns the index of the best matching string
' returnType = 2           : returns the similarity metric

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim metric, bestMetric As Double
    Dim i, iBest As Long
    Const RETURN_STRING As Integer = 0
    Const RETURN_INDEX As Integer = 1
    Const RETURN_METRIC As Integer = 2

    If IsMissing(returnType) Then returnType = RETURN_STRING

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    bestMetric = -1
    iBest = -1

    For i = 1 To rRng.Count
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(i)), sPairs2
        metric = SimilarityMetric(sPairs1, sPairs2)
        If metric > bestMetric Then
            bestMetric = metric
            iBest = i
        End If
        Set sPairs2 = Nothing
    Next i

    If iBest = -1 Then
        strSimLookup = CVErr(xlErrValue)
        Exit Function
    End If

    Select Case returnType
    Case RETURN_STRING
        strSimLookup = CStr(rRng(iBest))
    Case RETURN_INDEX
        strSimLookup = iBest
    Case Else
        strSimLookup = bestMetric
    End Select

End Function

Public Function strSim(str1 As String, str2 As String) As Variant
    Dim ilen, iLen1, ilen2 As Integer

    iLen1 = Len(str1)
    ilen2 = Len(str2)

    If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1

    strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen))

End Function

Sub WordLetterPairs(str As String, pairColl As Collection)
'Tokenize str into words, then add all letter pairs to pairColl

    Dim Words() As String
    Dim word, nPairs, pair As Integer

    Words = Split(str)

    If UBound(Words) < 0 Then
        Set pairColl = Nothing
        Exit Sub
    End If

    For word = 0 To UBound(Words)
        nPairs = Len(Words(word)) - 1
        If nPairs > 0 Then
            For pair = 1 To nPairs
                pairColl.Add Mid(Words(word), pair, 2)
            Next pair
        End If
    Next word

End Sub

Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant
'Helper function to calculate similarity metric given two collections of letter pairs.
'This function is designed to allow the pair collections to be set up separately as needed.
'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection
'if this is not the desired behavior.
'Also assumes that collections will be deallocated somewhere else

    Dim Intersect As Double
    Dim Union As Double
    Dim i, j As Long

    If sPairs1.Count = 0 Or sPairs2.Count = 0 Then
        SimilarityMetric = CVErr(xlErrNA)
        Exit Function
    End If

    Union = sPairs1.Count + sPairs2.Count
    Intersect = 0

    For i = 1 To sPairs1.Count
        For j = 1 To sPairs2.Count
            If StrComp(sPairs1(i), sPairs2(j)) = 0 Then
                Intersect = Intersect + 1
                sPairs2.Remove j
                Exit For
            End If
        Next j
    Next i

    SimilarityMetric = (2 * Intersect) / Union

End Function

Es tut mir leid, war die Antwort nicht vom Autor erfunden. Dies ist ein bekannter Algorithmus, der erste Geschenk von Digital Equipment Corporation wurde und wird oft als Schindeln bezeichnet.

http://www.hpl.hp.com/techreports /Compaq-DEC/SRC-TN-1997-015.pdf

Ich übersetzte Simon White-Algorithmus auf PL / pgSQL. Das ist mein Beitrag.

<!-- language: lang-sql -->

create or replace function spt1.letterpairs(in p_str varchar) 
returns varchar  as 
$$
declare

    v_numpairs integer := length(p_str)-1;
    v_pairs varchar[];

begin

    for i in 1 .. v_numpairs loop
        v_pairs[i] := substr(p_str, i, 2);
    end loop;

    return v_pairs;

end;
$$ language 'plpgsql';

--===================================================================

create or replace function spt1.wordletterpairs(in p_str varchar) 
returns varchar as
$$
declare
    v_allpairs varchar[];
    v_words varchar[];
    v_pairsinword varchar[];
begin
    v_words := regexp_split_to_array(p_str, '[[:space:]]');

    for i in 1 .. array_length(v_words, 1) loop
        v_pairsinword := spt1.letterpairs(v_words[i]);

        if v_pairsinword is not null then
            for j in 1 .. array_length(v_pairsinword, 1) loop
                v_allpairs := v_allpairs || v_pairsinword[j];
            end loop;
        end if;

    end loop;


    return v_allpairs;
end;
$$ language 'plpgsql';

--===================================================================

create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY)
returns anyarray as 
$$
    select array(select unnest($1) intersect select unnest($2))
$$ language 'sql';

--===================================================================

create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar)
returns float as
$$
declare
    v_pairs1 varchar[];
    v_pairs2 varchar[];
    v_intersection integer;
    v_union integer;
begin
    v_pairs1 := wordletterpairs(upper(p_str1));
    v_pairs2 := wordletterpairs(upper(p_str2));
    v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1); 

    v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1);

    return (2.0 * v_intersection / v_union);
end;
$$ language 'plpgsql';

Eine schnellere PHP Version des Algorithmus:

/**
 *
 * @param $str
 * @return mixed
 */
private static function wordLetterPairs ($str)
{
    $allPairs = array();

    // Tokenize the string and put the tokens/words into an array

    $words = explode(' ', $str);

    // For each word
    for ($w = 0; $w < count($words); $w ++) {
        // Find the pairs of characters
        $pairsInWord = self::letterPairs($words[$w]);

        for ($p = 0; $p < count($pairsInWord); $p ++) {
            $allPairs[$pairsInWord[$p]] = $pairsInWord[$p];
        }
    }

    return array_values($allPairs);
}

/**
 *
 * @param $str
 * @return array
 */
private static function letterPairs ($str)
{
    $numPairs = mb_strlen($str) - 1;
    $pairs = array();

    for ($i = 0; $i < $numPairs; $i ++) {
        $pairs[$i] = mb_substr($str, $i, 2);
    }

    return $pairs;
}

/**
 *
 * @param $str1
 * @param $str2
 * @return float
 */
public static function compareStrings ($str1, $str2)
{
    $pairs1 = self::wordLetterPairs(mb_strtolower($str1));
    $pairs2 = self::wordLetterPairs(mb_strtolower($str2));


    $union = count($pairs1) + count($pairs2);

    $intersection = count(array_intersect($pairs1, $pairs2));

    return (2.0 * $intersection) / $union;
}

Für die Daten, die ich hatte (ca. 2300 Vergleiche) Ich hatte eine Laufzeit von 0.58sec mit Igal Alkon Lösung im Vergleich zu 0.35sec mit mir.

Eine Version in schöner Scala:

  def pairDistance(s1: String, s2: String): Double = {

    def strToPairs(s: String, acc: List[String]): List[String] = {
      if (s.size < 2) acc
      else strToPairs(s.drop(1),
        if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2)))
    }

    val lst1 = strToPairs(s1.toUpperCase, List())
    val lst2 = strToPairs(s2.toUpperCase, List())

    (2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size)

  }

String Ähnlichkeit Metrics einen Überblick über viele verschiedene Metriken enthält verwendet in String-Vergleich ( Wikipedia hat einen Überblick als auch). Ein großer Teil dieser Metriken wird in einer Bibliothek implementiert simmetrics .

Ein weiteres Beispiel für metrische, nicht in der gegebenen Übersicht enthalten ist zum Beispiel: Pressweg (Versuch, den Kolmogorov Komplexität zu approximieren), die verwendet werden können, für etwas längere Texte als das von Ihnen vorgestellt.

Sie könnten auch erwägen Natural Language Processing zu einem viel breiteren Thema suchen. Diese R-Pakete können Sie einige Ideen schnell (oder zumindest geben loszulegen ).

Und man zuletzt - die anderen Fragen zu diesem Thema sucht, an SO, gibt es durchaus ein paar verwandt diejenigen

Hier ist die R-Version:

get_bigrams <- function(str)
{
  lstr = tolower(str)
  bigramlst = list()
  for(i in 1:(nchar(str)-1))
  {
    bigramlst[[i]] = substr(str, i, i+1)
  }
  return(bigramlst)
}

str_similarity <- function(str1, str2)
{
   pairs1 = get_bigrams(str1)
   pairs2 = get_bigrams(str2)
   unionlen  = length(pairs1) + length(pairs2)
   hit_count = 0
   for(x in 1:length(pairs1)){
        for(y in 1:length(pairs2)){
            if (pairs1[[x]] == pairs2[[y]])
                hit_count = hit_count + 1
        }
   }
   return ((2.0 * hit_count) / unionlen)
}

Marzagão Antwort in C99, inspiriert von diese Algorithmen

double dice_match(const char *string1, const char *string2) {

    //check fast cases
    if (((string1 != NULL) && (string1[0] == '\0')) || 
        ((string2 != NULL) && (string2[0] == '\0'))) {
        return 0;
    }
    if (string1 == string2) {
        return 1;
    }

    size_t strlen1 = strlen(string1);
    size_t strlen2 = strlen(string2);
    if (strlen1 < 2 || strlen2 < 2) {
        return 0;
    }

    size_t length1 = strlen1 - 1;
    size_t length2 = strlen2 - 1;

    double matches = 0;
    int i = 0, j = 0;

    //get bigrams and compare
    while (i < length1 && j < length2) {
        char a[3] = {string1[i], string1[i + 1], '\0'};
        char b[3] = {string2[j], string2[j + 1], '\0'};
        int cmp = strcmpi(a, b);
        if (cmp == 0) {
            matches += 2;
        }
        i++;
        j++;
    }

    return matches / (length1 + length2);
}

Einige Tests basierend auf dem Original-Artikel :

#include <stdio.h>

void article_test1() {
    char *string1 = "FRANCE";
    char *string2 = "FRENCH";
    printf("====%s====\n", __func__);
    printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}


void article_test2() {
    printf("====%s====\n", __func__);
    char *string = "Healed";
    char *ss[] = {"Heard", "Healthy", "Help",
                  "Herded", "Sealed", "Sold"};
    int correct[] = {44, 55, 25, 40, 80, 0};
    for (int i = 0; i < 6; ++i) {
        printf("%2.f%% == %d%%\n", dice_match(string, ss[i]) * 100, correct[i]);
    }
}

void multicase_test() {
    char *string1 = "FRaNcE";
    char *string2 = "fREnCh";
    printf("====%s====\n", __func__);
    printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);

}

void gg_test() {
    char *string1 = "GG";
    char *string2 = "GGGGG";
    printf("====%s====\n", __func__);
    printf("%2.f%% != 100%%\n", dice_match(string1, string2) * 100);
}


int main() {
    article_test1();
    article_test2();
    multicase_test();
    gg_test();

    return 0;
}

Aufbauend auf Michael La Voie ehrfürchtige C # Version, gemäß dem Wunsch es eine Erweiterungsmethode zu machen, hier ist das, was ich kam mit. Der Hauptvorteil der es auf diese Weise tun, ist, dass Sie eine generische Liste durch den Prozentsatz Spiel sortieren. Betrachten wir zum Beispiel ein String Feld mit dem Namen „City“ in Ihrem Objekt haben. Ein Nutzer sucht nach „Chester“ und Sie wollen Ergebnisse zurück, um Spiel absteigend. Zum Beispiel wollen Sie wörtliche Übereinstimmungen von Chester vor Rochester zu zeigen. Um dies zu tun, fügen Sie zwei neue Eigenschaften zu Ihrem Objekt:

    public string SearchText { get; set; }
    public double PercentMatch
    {
        get
        {
            return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper());
        }
    }

Dann auf jedes Objekt, stellen Sie den Suchtext zu dem, was für den Benutzer gesucht. Dann können Sie es leicht mit so etwas wie sortieren:

    zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch);

Hier ist die leichte Modifikation es eine Erweiterungsmethode zu machen:

    /// <summary>
    /// This class implements string comparison algorithm
    /// based on character pair similarity
    /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
    /// </summary>
    public static double PercentMatchTo(this string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private static List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private static  string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }

Meine JavaScript-Implementierung nimmt einen String oder ein Array von Strings und ein optionalen Boden (der Standard Etage 0,5). Wenn Sie eine Zeichenfolge übergeben, wird er zurückkehren wahr oder falsch, je nachdem, ob es sich bei dem Ähnlichkeitswert der Zeichenfolge größer oder gleich dem Boden. Wenn Sie es ein Array von Strings übergeben, wird es eine Reihe von diesen Strings, deren Ähnlichkeitswert zurückkehren größer oder gleich dem Boden, sortiert nach Punkten.

Beispiele:

'Healed'.fuzzy('Sealed');      // returns true
'Healed'.fuzzy('Help');        // returns false
'Healed'.fuzzy('Help', 0.25);  // returns true

'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']);
// returns ["Sealed", "Healthy"]

'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0);
// returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"]

Hier ist sie:

(function(){
  var default_floor = 0.5;

  function pairs(str){
    var pairs = []
      , length = str.length - 1
      , pair;
    str = str.toLowerCase();
    for(var i = 0; i < length; i++){
      pair = str.substr(i, 2);
      if(!/\s/.test(pair)){
        pairs.push(pair);
      }
    }
    return pairs;
  }

  function similarity(pairs1, pairs2){
    var union = pairs1.length + pairs2.length
      , hits = 0;

    for(var i = 0; i < pairs1.length; i++){
      for(var j = 0; j < pairs1.length; j++){
        if(pairs1[i] == pairs2[j]){
          pairs2.splice(j--, 1);
          hits++;
          break;
        }
      }
    }
    return 2*hits/union || 0;
  }

  String.prototype.fuzzy = function(strings, floor){
    var str1 = this
      , pairs1 = pairs(this);

    floor = typeof floor == 'number' ? floor : default_floor;

    if(typeof(strings) == 'string'){
      return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase();
    }else if(strings instanceof Array){
      var scores = {};

      strings.map(function(str2){
        scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase());
      });

      return strings.filter(function(str){
        return scores[str] >= floor;
      }).sort(function(a, b){
        return scores[b] - scores[a];
      });
    }
  };
})();

Und hier ist eine verkleinerte Version für Ihre Bequemlichkeit:

(function(){function g(a){var b=[],e=a.length-1,d;a=a.toLowerCase();for(var c=0;c<e;c++)d=a.substr(c,2),/\s/.test(d)||b.push(d);return b}function h(a,b){for(var e=a.length+b.length,d=0,c=0;c<a.length;c++)for(var f=0;f<a.length;f++)if(a[c]==b[f]){b.splice(f--,1);d++;break}return 2*d/e||0}String.prototype.fuzzy=function(a,b){var e=this,d=g(this);b="number"==typeof b?b:0.5;if("string"==typeof a)return 1<e.length&&1<a.length&&h(d,g(a))>=b||e.toLowerCase()==a.toLowerCase();if(a instanceof Array){var c={};a.map(function(a){c[a]=1<e.length?h(d,g(a)):1*(e.toLowerCase()==a.toLowerCase())});return a.filter(function(a){return c[a]>=b}).sort(function(a,b){return c[b]-c[a]})}}})();

Der Dice-Koeffizienten-Algorithmus (Simon White / Marzagão Antwort) wird in Ruby in der Umsetzung pair_distance_similar Methode im amatch gem

https://github.com/flori/amatch

Dieses Juwel enthält auch Implementierungen eine Anzahl von ungefähren Abgleich und String-Vergleichsalgorithmen: Levenshtein Editierdistanz, Sellers bearbeitet Abstand, die Hamming-Distanz, die längsten gemeinsamen Teilfolge der Länge, die längsten gemeinsamen Teilzeichenlänge, die Paar Distanzmetrik, die Jaro -Winkler Metrik.

Eine Haskell-Version-Feel Bearbeitungen frei vorzuschlagen, weil ich nicht viel Haskell getan haben.

import Data.Char
import Data.List

-- Convert a string into words, then get the pairs of words from that phrase
wordLetterPairs :: String -> [String]
wordLetterPairs s1 = concat $ map pairs $ words s1

-- Converts a String into a list of letter pairs.
pairs :: String -> [String]
pairs [] = []
pairs (x:[]) = []
pairs (x:ys) = [x, head ys]:(pairs ys)

-- Calculates the match rating for two strings
matchRating :: String -> String -> Double
matchRating s1 s2 = (numberOfMatches * 2) / totalLength
  where pairsS1 = wordLetterPairs $ map toLower s1
        pairsS2 = wordLetterPairs $ map toLower s2
        numberOfMatches = fromIntegral $ length $ pairsS1 `intersect` pairsS2
        totalLength = fromIntegral $ length pairsS1 + length pairsS2

Clojure:

(require '[clojure.set :refer [intersection]])

(defn bigrams [s]
  (->> (split s #"\s+")
       (mapcat #(partition 2 1 %))
       (set)))

(defn string-similarity [a b]
  (let [a-pairs (bigrams a)
        b-pairs (bigrams b)
        total-count (+ (count a-pairs) (count b-pairs))
        match-count (count (intersection a-pairs b-pairs))
        similarity (/ (* 2 match-count) total-count)]
    similarity))

Was ist Levenshtein-Distanz, um die Länge des ersten Strings geteilt (oder alternativ meine min / max / avg Länge beiden Strings geteilt)? Das hat für mich bisher gearbeitet haben.

Hey Jungs gab ich dies einen Versuch in Javascript, aber ich bin neu, um es, jemand wissen, schnellere Wege, es zu tun?

function get_bigrams(string) {
    // Takes a string and returns a list of bigrams
    var s = string.toLowerCase();
    var v = new Array(s.length-1);
    for (i = 0; i< v.length; i++){
        v[i] =s.slice(i,i+2);
    }
    return v;
}

function string_similarity(str1, str2){
    /*
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    */
    var pairs1 = get_bigrams(str1);
    var pairs2 = get_bigrams(str2);
    var union = pairs1.length + pairs2.length;
    var hit_count = 0;
    for (x in pairs1){
        for (y in pairs2){
            if (pairs1[x] == pairs2[y]){
                hit_count++;
            }
        }
    }
    return ((2.0 * hit_count) / union);
}


var w1 = 'Healed';
var word =['Heard','Healthy','Help','Herded','Sealed','Sold']
for (w2 in word){
    console.log('Healed --- ' + word[w2])
    console.log(string_similarity(w1,word[w2]));
}

Hier ist eine andere Version von Ähnlichkeit mit Sitz in Sørensen-Dice-Index (Marzagão Antwort), diese in C ++ geschrieben 11:

/*
 * Similarity based in Sørensen–Dice index.
 *
 * Returns the Similarity between _str1 and _str2.
 */
double similarity_sorensen_dice(const std::string& _str1, const std::string& _str2) {
    // Base case: if some string is empty.
    if (_str1.empty() || _str2.empty()) {
        return 1.0;
    }

    auto str1 = upper_string(_str1);
    auto str2 = upper_string(_str2);

    // Base case: if the strings are equals.
    if (str1 == str2) {
        return 0.0;
    }

    // Base case: if some string does not have bigrams.
    if (str1.size() < 2 || str2.size() < 2) {
        return 1.0;
    }

    // Extract bigrams from str1
    auto num_pairs1 = str1.size() - 1;
    std::unordered_set<std::string> str1_bigrams;
    str1_bigrams.reserve(num_pairs1);
    for (unsigned i = 0; i < num_pairs1; ++i) {
        str1_bigrams.insert(str1.substr(i, 2));
    }

    // Extract bigrams from str2
    auto num_pairs2 = str2.size() - 1;
    std::unordered_set<std::string> str2_bigrams;
    str2_bigrams.reserve(num_pairs2);
    for (unsigned int i = 0; i < num_pairs2; ++i) {
        str2_bigrams.insert(str2.substr(i, 2));
    }

    // Find the intersection between the two sets.
    int intersection = 0;
    if (str1_bigrams.size() < str2_bigrams.size()) {
        const auto it_e = str2_bigrams.end();
        for (const auto& bigram : str1_bigrams) {
            intersection += str2_bigrams.find(bigram) != it_e;
        }
    } else {
        const auto it_e = str1_bigrams.end();
        for (const auto& bigram : str2_bigrams) {
            intersection += str1_bigrams.find(bigram) != it_e;
        }
    }

    // Returns similarity coefficient.
    return (2.0 * intersection) / (num_pairs1 + num_pairs2);
}

Ich suchte nach einer reinen Ruby-Implementierung des in der Antwort von @marzagao angegebenen Algorithmus.Leider ist der von @marzagao angegebene Link defekt.In der Antwort von @s01ipsist wies er auf Ruby Gem hin eine Übereinstimmung wo die Implementierung nicht in reinem Ruby erfolgt.Also habe ich ein wenig gesucht und ein Juwel gefunden fuzzy_match das eine reine Ruby-Implementierung hat (obwohl dieses Juwel verwendet wird). amatch) bei Hier.Ich hoffe, das hilft jemandem wie mir.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow