Similitude Comparaison de chaînes en Java

https://stackoverflow.com/questions/955110

11-09-2019
|

Question

Je veux comparer plusieurs cordes à l'autre, et trouver ceux qui sont les plus semblables. Je me demandais s'il y a une bibliothèque, une méthode ou les meilleures pratiques qui me renvoie des chaînes qui sont plus semblables à d'autres chaînes. Par exemple:

"Le renard a sauté rapide" -> "Le renard a sauté"
"Le renard a sauté rapide" -> "Le renard"

Cette comparaison retournerait que le premier est plus similaire que le second.

Je suppose que j'ai besoin d'une méthode telle que:

double similarityIndex(String s1, String s2)

Y at-il une telle chose quelque part?

EDIT: Pourquoi je fais cela? Je suis en train d'écrire un script qui compare la sortie d'un fichier MS Project à la sortie de certains système existant qui gère les tâches. Parce que le système existant a une largeur de champ très limité, lorsque les valeurs sont ajoutées les descriptions sont abrégées. Je veux un moyen de trouver les entrées de MS Project semi-automatique sont similaires aux entrées sur le système que je peux obtenir les clés générées. Il présente des inconvénients, car il doit encore être vérifiée manuellement, mais cela permettrait d'économiser beaucoup de travail

La solution

Oui, il existe de nombreux algorithmes bien documentés comme:

similarité cosinus
similitude Jaccard
coefficient de Dice
similarité de correspondance
similitude Overlap
etc etc

Vous pouvez également vous pouvez le vérifier

Vérifiez aussi ces projets:

Autres conseils

La façon courante de calcul de la similarité entre deux chaînes dans un 0% à 100% mode , tel qu'il est utilisé dans de nombreuses bibliothèques, est de mesurer la quantité (en%) que vous auriez à changer la chaîne plus longue pour en faire la plus courte:

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below

Calcul de la `editDistance()`:

La fonction editDistance() devrait ci-dessus pour calculer la distance d'édition entre les deux chaînes. Il y a plusieurs implémentations à cette étape, chacun peut mieux convenir d'un scénario spécifique. Le plus courant est le Levenshtein algorithme distance et nous allons l'utiliser dans notre exemple ci-dessous (pour les chaînes très grandes, d'autres algorithmes sont susceptibles de mieux).

Voici deux options pour calculer la distance d'édition:

Vous pouvez utiliser Apache Commons texte 's la mise en œuvre de la distance Levenshtein: apply(CharSequence left, CharSequence rightt)
Mettre en œuvre dans votre propre. Ci-dessous vous trouverez un exemple d'implémentation.

Exemple de travail:

Voir la démo en ligne ici.

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

Sortie:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

Je traduis Levenshtein algorithme distance en JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};

Vous pouvez utiliser la distance Levenshtein pour calculer la différence entre les deux chaînes. http://en.wikipedia.org/wiki/Levenshtein_distance

Il y a en effet beaucoup de mesures de similarité de chaîne là-bas:

Levenshtein distance d'édition;
distance Damerau-Levenshtein;
similitude Jaro-Winkler;
Longest Common Subsequence distance d'édition;
Q-Gram (Ukkonen);
distance n-gramme (Kondrak);
Index Jaccard;
coefficient Sorensen-dés;
similarité de Cosinus;
...

Vous pouvez trouver la mise en œuvre d'explication et java de ceux-ci ici: https://github.com/tdebatty/java-string-similarity

Vous pouvez y parvenir en utilisant le communes apache bibliothèque java . Jetez un oeil à ces deux fonctions en son sein:
- getLevenshteinDistance
- getFuzzyDistance

En théorie, vous pouvez comparer distances modifier.

Ceci est généralement fait en utilisant un . Recherche de « distance modifier java » tourne un certain nombre de bibliothèques, comme celui-ci .

Sons comme un plagiat finder pour moi si vos tours de cordes dans un document. Peut-être que la recherche avec ce terme se retrouvera quelque chose de bien.

« Programmation Intelligence collective » a un chapitre sur déterminer si deux documents sont similaires. Le code est en Python, mais il est propre et facile au port.

Merci à la première answerer, je pense qu'il y a 2 calculs de computeEditDistance (s1, s2). En raison de dépenses de temps de celui-ci, a décidé d'améliorer la performance du code. Donc:

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow

Similitude Comparaison de chaînes en Java

Calcul de la editDistance():

Exemple de travail:

Calcul de la `editDistance()`: