Confronto dei nomi

https://stackoverflow.com/questions/68408

09-06-2019
|

Domanda

Esiste un algoritmo "semplice" per determinare la probabilità che 2 nomi rappresentino la stessa persona?Non sto chiedendo qualcosa del livello che potrebbe utilizzare il dipartimento Custom.Solo un semplice algoritmo che mi direbbe se 'James T.Clark' è molto probabilmente lo stesso nome di 'J.Thomas Clark" o "James Clerk".

Se ci fosse un algoritmo in C# sarebbe fantastico, ma posso tradurre da qualsiasi lingua.

Soluzione

Ho riscontrato un problema simile e ho provato prima a utilizzare la distanza Levenstein, ma non ha funzionato bene per me.Ho ideato un algoritmo che fornisce un valore di "somiglianza" tra due stringhe (un valore più alto significa stringhe più simili, "1" per stringhe identiche).Questo valore non è molto significativo di per sé (se non "1", sempre 0,5 o meno), ma funziona abbastanza bene quando si utilizza la matrice ungherese per trovare coppie corrispondenti da due elenchi di stringhe.

Utilizzare in questo modo:

PartialStringComparer cmp = new PartialStringComparer();
tbResult.Text = cmp.Compare(textBox1.Text, textBox2.Text).ToString();

Il codice dietro:

public class SubstringRange {
    string masterString;

    public string MasterString {
        get { return masterString; }
        set { masterString = value; }
    }
    int start;

    public int Start {
        get { return start; }
        set { start = value; }
    }
    int end;

    public int End {
        get { return end; }
        set { end = value; }
    }
    public int Length {
        get { return End - Start; }
        set { End = Start + value;}
    }

    public bool IsValid {
        get { return MasterString.Length >= End && End >= Start && Start >= 0; }
    }

    public string Contents {
        get {
            if(IsValid) {
                return MasterString.Substring(Start, Length);
            } else {
                return "";
            }
        }
    }
    public bool OverlapsRange(SubstringRange range) {
        return !(End < range.Start || Start > range.End);
    }
    public bool ContainsRange(SubstringRange range) {
        return range.Start >= Start && range.End <= End;
    }
    public bool ExpandTo(string newContents) {
        if(MasterString.Substring(Start).StartsWith(newContents, StringComparison.InvariantCultureIgnoreCase) && newContents.Length > Length) {
            Length = newContents.Length;
            return true;
        } else {
            return false;
        }
    }
}

public class SubstringRangeList: List<SubstringRange> {
    string masterString;

    public string MasterString {
        get { return masterString; }
        set { masterString = value; }
    }

    public SubstringRangeList(string masterString) {
        this.MasterString = masterString;
    }

    public SubstringRange FindString(string s){
        foreach(SubstringRange r in this){
            if(r.Contents.Equals(s, StringComparison.InvariantCultureIgnoreCase))
                return r;
        }
        return null;
    }

    public SubstringRange FindSubstring(string s){
        foreach(SubstringRange r in this){
            if(r.Contents.StartsWith(s, StringComparison.InvariantCultureIgnoreCase))
                return r;
        }
        return null;
    }

    public bool ContainsRange(SubstringRange range) {
        foreach(SubstringRange r in this) {
            if(r.ContainsRange(range))
                return true;
        }
        return false;
    }

    public bool AddSubstring(string substring) {
        bool result = false;
        foreach(SubstringRange r in this) {
            if(r.ExpandTo(substring)) {
                result = true;
            }
        }
        if(FindSubstring(substring) == null) {
            bool patternfound = true;
            int start = 0;
            while(patternfound){
                patternfound = false;
                start = MasterString.IndexOf(substring, start, StringComparison.InvariantCultureIgnoreCase);
                patternfound = start != -1;
                if(patternfound) {
                    SubstringRange r = new SubstringRange();
                    r.MasterString = this.MasterString;
                    r.Start = start++;
                    r.Length = substring.Length;
                    if(!ContainsRange(r)) {
                        this.Add(r);
                        result = true;
                    }
                }
            }
        }
        return result;
    }

    private static bool SubstringRangeMoreThanOneChar(SubstringRange range) {
        return range.Length > 1;
    }

    public float Weight {
        get {
            if(MasterString.Length == 0 || Count == 0)
                return 0;
            float numerator = 0;
            int denominator = 0;
            foreach(SubstringRange r in this.FindAll(SubstringRangeMoreThanOneChar)) {
                numerator += r.Length;
                denominator++;
            }
            if(denominator == 0)
                return 0;
            return numerator / denominator / MasterString.Length;
        }
    }

    public void RemoveOverlappingRanges() {
        SubstringRangeList l = new SubstringRangeList(this.MasterString);
        l.AddRange(this);//create a copy of this list
        foreach(SubstringRange r in l) {
            if(this.Contains(r) && this.ContainsRange(r)) {
                Remove(r);//try to remove the range
                if(!ContainsRange(r)) {//see if the list still contains "superset" of this range
                    Add(r);//if not, add it back
                }
            }
        }
    }

    public void AddStringToCompare(string s) {
        for(int start = 0; start < s.Length; start++) {
            for(int len = 1; start + len <= s.Length; len++) {
                string part = s.Substring(start, len);
                if(!AddSubstring(part))
                    break;
            }
        }
        RemoveOverlappingRanges();
    }
}

public class PartialStringComparer {
    public float Compare(string s1, string s2) {
        SubstringRangeList srl1 = new SubstringRangeList(s1);
        srl1.AddStringToCompare(s2);
        SubstringRangeList srl2 = new SubstringRangeList(s2);
        srl2.AddStringToCompare(s1);
        return (srl1.Weight + srl2.Weight) / 2;
    }
}

La distanza uno di Levenstein è molto più semplice (adattato da http://www.merriampark.com/ld.htm):

public class Distance {
    /// <summary>
    /// Compute Levenshtein distance
    /// </summary>
    /// <param name="s">String 1</param>
    /// <param name="t">String 2</param>
    /// <returns>Distance between the two strings.
    /// The larger the number, the bigger the difference.
    /// </returns>
    public static int LD(string s, string t) {
        int n = s.Length; //length of s
        int m = t.Length; //length of t
        int[,] d = new int[n + 1, m + 1]; // matrix
        int cost; // cost
        // Step 1
        if(n == 0) return m;
        if(m == 0) return n;
        // Step 2
        for(int i = 0; i <= n; d[i, 0] = i++) ;
        for(int j = 0; j <= m; d[0, j] = j++) ;
        // Step 3
        for(int i = 1; i <= n; i++) {
            //Step 4
            for(int j = 1; j <= m; j++) {
                // Step 5
                cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
                // Step 6
                d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

Altri suggerimenti

Sembra che tu stia cercando algoritmi basati sulla fonetica, come soundex, NYSIIS, O doppio metafono.Il primo in realtà È ciò che utilizzano diversi dipartimenti governativi ed è banale da implementare (con molte implementazioni prontamente disponibile).La seconda è una versione leggermente più complicata e più precisa della prima.Quest'ultimo funziona principalmente con alcuni nomi e alfabeti non inglesi.

Distanza di Levenshtein è una definizione di distanza tra due stringhe arbitrarie.Ti dà una distanza pari a 0 tra stringhe identiche e diversa da zero tra stringhe diverse, il che potrebbe essere utile anche se decidi di creare un algoritmo personalizzato.

Levenstein è vicino, anche se forse non è esattamente quello che desideri.

Dubito che ci sia, considerando anche il Il Dipartimento delle Dogane non sembra avere una risposta soddisfacente...

Se esiste una soluzione a questo problema, dubito seriamente che faccia parte del core C#.A prima vista, richiederebbe un database delle frequenze del nome, del secondo nome e del cognome, nonché un conto per le iniziali, come nel tuo esempio.Si tratta di una logica abbastanza complessa che si basa su un database di informazioni.

Secondo la distanza di Levenshtein, che lingua vuoi?Sono riuscito a trovare abbastanza facilmente un'implementazione in C# su codeproject.

In un'applicazione su cui ho lavorato, il campo Cognome era considerato affidabile.Quindi vengono presentati all'utente tutti i record con lo stesso cognome.L'utente può ordinare in base agli altri campi per cercare nomi simili.Questa soluzione era sufficientemente valida da ridurre notevolmente il problema degli utenti che creavano record duplicati.

Fondamentalmente sembra che la questione richiederà un giudizio umano.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow