我想要的比较数字符串彼此,并找到那些最类似。我想知道如果有任何库、方法或最佳做法,将返回我这弦更类似于其他串。例如:

  • "快速跳狐狸"->"狐狸上跳下"
  • "快速跳狐狸"->"的狐狸"

这一比较将会返回,首先是更加类似于第二位。

我猜我需要一些方法,如:

double similarityIndex(String s1, String s2)

是否有这样的事情的地方?

编辑:为什么我要这么做?我写的脚本比较输出的一个MS项目的文件给输出的一些遗留系统处理的任务。因为遗留系统具有非常有限场的宽度,当值增加了描述的缩写。我想要一些半自动的方式来找到其中的条目从MS项目类似的项目在系统上这样我就可以得到所产生的钥匙。它有缺点,因为它必须仍手动检查,但这将节省大量的工作

有帮助吗?

解决方案

是的,有许多很好的记载的算法,如:

  • 余弦的相似性
  • Jaccard相似性
  • 骰子的系数
  • 匹配的相似性
  • 重叠的相似性
  • 等等等等

或者 你可以检查这个

检查也为这些项目:

其他提示

共同的方式 计算相似之间的两个串在一个0%-100%时尚, 在许多图书馆,是衡量多少(%)你就会改变长串把它变成较短:

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below


计算 editDistance():

editDistance() 功能上是预期的计算 编辑距离 两者之间的串。还有 几个实现的 到这一步,每个可能适合特定情况更好。最常见的是的 Levenshtein distance算法 我们会用它在我们下面的例子(很大串,其他算法可能更好地执行).

这里有两个选项计算的编辑距离:


工作的例子:

在网上看到演示的在这里。

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

输出:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

我翻译 Levenshtein distance算法 为JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};

你可以使用的编辑距离计算之间的差异,两串。http://en.wikipedia.org/wiki/Levenshtein_distance

确实有许多串的相似性措施:

  • Levenshtein编辑的距离;
  • Damerau-Levenshtein的距离;
  • 哈罗-Winkler相似性;
  • 最常见的子序列编辑的距离;
  • Q-Gram(Ukkonen);
  • n-Gram距离(Kondrak);
  • Jaccard索引;
  • Sorensen骰子系数;
  • 余弦的相似性;
  • ...

你可以找到的解释和执行这些在这里:https://github.com/tdebatty/java-string-similarity

你可以实现这个用的 apache共同java的图书馆.看看这两个职能:
- getLevenshteinDistance
- getFuzzyDistance

理论上,可以比较 编辑的距离.

这通常是使用 编辑距离 措施。寻找"的编辑距离java"变成了一个数字图书馆,就像 这一个.

听起来像一个 剽窃搜索 我说如果你的字符串变成一个文件。也许在搜寻与该术语会变成了好东西。

"编程的集体智慧"有一章是关于确定两个文件都是相似的。代码是蟒蛇,但它的清洁和易于港口。

谢谢到的第一个回答者中,我认为有2的计算computeEditDistance(s1,s2)。由于高时间支出的,决定提高码的业绩。所以:

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top