优化 Jaro-Winkler 算法

https://stackoverflow.com/questions/2848807

27-09-2019
|

题

我有 Jaro-Winkler 算法的代码，取自这网站。我需要跑 150,000 次才能获得差异之间的距离。由于我在 Android 移动设备上运行，因此需要很长时间。

可以再优化一下吗？

public class Jaro {
    /**
     * gets the similarity of the two strings using Jaro distance.
     *
     * @param string1 the first input string
     * @param string2 the second input string
     * @return a value between 0-1 of the similarity
     */
    public float getSimilarity(final String string1, final String string2) {

        //get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
        final int halflen = ((Math.min(string1.length(), string2.length())) / 2) + ((Math.min(string1.length(), string2.length())) % 2);

        //get common characters
        final StringBuffer common1 = getCommonCharacters(string1, string2, halflen);
        final StringBuffer common2 = getCommonCharacters(string2, string1, halflen);

        //check for zero in common
        if (common1.length() == 0 || common2.length() == 0) {
            return 0.0f;
        }

        //check for same length common strings returning 0.0f is not the same
        if (common1.length() != common2.length()) {
            return 0.0f;
        }

        //get the number of transpositions
        int transpositions = 0;
        int n=common1.length();
        for (int i = 0; i < n; i++) {
            if (common1.charAt(i) != common2.charAt(i))
                transpositions++;
        }
        transpositions /= 2.0f;

        //calculate jaro metric
        return (common1.length() / ((float) string1.length()) +
                common2.length() / ((float) string2.length()) +
                (common1.length() - transpositions) / ((float) common1.length())) / 3.0f;
    }

    /**
     * returns a string buffer of characters from string1 within string2 if they are of a given
     * distance seperation from the position in string1.
     *
     * @param string1
     * @param string2
     * @param distanceSep
     * @return a string buffer of characters from string1 within string2 if they are of a given
     *         distance seperation from the position in string1
     */
    private static StringBuffer getCommonCharacters(final String string1, final String string2, final int distanceSep) {
        //create a return buffer of characters
        final StringBuffer returnCommons = new StringBuffer();
        //create a copy of string2 for processing
        final StringBuffer copy = new StringBuffer(string2);
        //iterate over string1
        int n=string1.length();
        int m=string2.length();
        for (int i = 0; i < n; i++) {
            final char ch = string1.charAt(i);
            //set boolean for quick loop exit if found
            boolean foundIt = false;
            //compare char with range of characters to either side

            for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, m - 1); j++) {
                //check if found
                if (copy.charAt(j) == ch) {
                    foundIt = true;
                    //append character found
                    returnCommons.append(ch);
                    //alter copied string2 for processing
                    copy.setCharAt(j, (char)0);
                }
            }
        }
        return returnCommons;
    }
}

我提到在整个过程中我只制作脚本的实例，所以只制作一次

jaro= new Jaro();

如果您要测试并需要示例，以免破坏脚本，您会找到它这里, ，在另一个线程中进行python优化

解决方案

是的，但你是不是要去享受它。替换那些在构造函数分配并永远不会再使用整数索引来跟踪什么在他们字符数组所有这些newed StringBuffers，

这未决下议院郎补丁会给你一些味道

其他提示

我知道这个问题可能已经解决了一段时间，但我想对算法本身进行评论。当比较对自身的字符串，答案原来是1 / |串|关闭。当比较略有不同的值，该值也变成是较低的。

此的解决方案是调节“M-1”至“M”在getCommonCharacters方法中的内for语句。然后，该代码就像一个魅力：）

请参阅 http://en.wikipedia.org/wiki/Jaro%E2 ％80％93Winkler_distance 与一些例子。

尽量避免 getCommonCharacters 循环中的两个嵌套循环。
关于如何进行的建议：将较小字符串中的所有字符存储在某种类型的映射中（java有一些），其中键是字符，值是位置，这样您仍然可以计算距离，无论它们是否相同。我不太明白这个算法，但我认为这是可行的。
除了这个和 bmargulies 的答案之外，我真的没有看到除了位等之外的进一步优化。如果这确实很重要，请考虑用 C 重写这部分？

我不知道很多关于Android和它如何与数据库工作。 WP7已经（将有:)）SQL CE。下一步通常是与您的数据的工作。添加字符串长度和限制你的比较。添加在列和排序长度，然后由价值指数。对长度的指数应被分类为好。我是有一个旧的服务器150个000医疗方面给我的建议上运行，并拼在0.5秒内检查，用户可以几乎没有注意到它，尤其是如果在一个单独的线程运行。

我的意思是在博客上写下了很长一段时间（如2年:)），因为有需要。但我终于设法写出来几个字，并提供一些提示。请看看这里：

ISolvable.blogspot.com

虽然它是微软的平台，还是一般的原则是相同的。

是的，这可以做得更快。一方面，您根本不需要 StringBuffer。另一方面，您不需要单独的循环来计算转置次数。

你可以找到我在这里的实现, ，而且应该会快很多。它遵循 Apache 2.0 许可证。

使用 GetCommonCharacters 方法返回常见字符，而是使用几个数组来保持匹配，类似于此处的 C 版本 https://github.com/miguelvps/c/blob/master/jarowinkler.c

/*Calculate matching characters*/
for (i = 0; i < al; i++) {
    for (j = max(i - range, 0), l = min(i + range + 1, sl); j < l; j++) {
        if (a[i] == s[j] && !sflags[j]) {
            sflags[j] = 1;
            aflags[i] = 1;
            m++;
            break;
        }
    }
}

另一个优化是为每个字符串预先计算一个位掩码。使用它，检查第一个字符串中的当前字符是否存在于第二个字符串中。这可以使用有效的按位运算来完成。

这将跳过计算最大/最小并循环查找丢失的字符。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow