Optimizing Jaro-Winkler algorithm

https://stackoverflow.com/questions/2848807

27-09-2019
|

Question

I have this code for Jaro-Winkler algorithm taken from this website. I need to run 150,000 times to get distance between differences. It takes a long time, as I run on an Android mobile device.

Can it be optimized more?

public class Jaro {
    /**
     * gets the similarity of the two strings using Jaro distance.
     *
     * @param string1 the first input string
     * @param string2 the second input string
     * @return a value between 0-1 of the similarity
     */
    public float getSimilarity(final String string1, final String string2) {

        //get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
        final int halflen = ((Math.min(string1.length(), string2.length())) / 2) + ((Math.min(string1.length(), string2.length())) % 2);

        //get common characters
        final StringBuffer common1 = getCommonCharacters(string1, string2, halflen);
        final StringBuffer common2 = getCommonCharacters(string2, string1, halflen);

        //check for zero in common
        if (common1.length() == 0 || common2.length() == 0) {
            return 0.0f;
        }

        //check for same length common strings returning 0.0f is not the same
        if (common1.length() != common2.length()) {
            return 0.0f;
        }

        //get the number of transpositions
        int transpositions = 0;
        int n=common1.length();
        for (int i = 0; i < n; i++) {
            if (common1.charAt(i) != common2.charAt(i))
                transpositions++;
        }
        transpositions /= 2.0f;

        //calculate jaro metric
        return (common1.length() / ((float) string1.length()) +
                common2.length() / ((float) string2.length()) +
                (common1.length() - transpositions) / ((float) common1.length())) / 3.0f;
    }

    /**
     * returns a string buffer of characters from string1 within string2 if they are of a given
     * distance seperation from the position in string1.
     *
     * @param string1
     * @param string2
     * @param distanceSep
     * @return a string buffer of characters from string1 within string2 if they are of a given
     *         distance seperation from the position in string1
     */
    private static StringBuffer getCommonCharacters(final String string1, final String string2, final int distanceSep) {
        //create a return buffer of characters
        final StringBuffer returnCommons = new StringBuffer();
        //create a copy of string2 for processing
        final StringBuffer copy = new StringBuffer(string2);
        //iterate over string1
        int n=string1.length();
        int m=string2.length();
        for (int i = 0; i < n; i++) {
            final char ch = string1.charAt(i);
            //set boolean for quick loop exit if found
            boolean foundIt = false;
            //compare char with range of characters to either side

            for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, m - 1); j++) {
                //check if found
                if (copy.charAt(j) == ch) {
                    foundIt = true;
                    //append character found
                    returnCommons.append(ch);
                    //alter copied string2 for processing
                    copy.setCharAt(j, (char)0);
                }
            }
        }
        return returnCommons;
    }
}

I mention that in the whole process I make just instance of the script, so only once

jaro= new Jaro();

If you are going to test and need examples so not break the script, you will find it here, in another thread for python optimization

Solution

Yes, but you aren't going to enjoy it. Replace all those newed StringBuffers with char arrays that are allocated in the constructor and never again, using integer indices to keep track of what's in them.

This pending Commons-Lang patch will give you some of the flavor.

OTHER TIPS

I know this question has probably been solved for some time, but I would like to comment on the algorithm itself. When comparing a string against itself, the answer turns out to be 1/|string| off. When comparing slightly different values, the values also turn out to be lower.

The solution to this is to adjust 'm-1' to 'm' in the inner for-statement within the getCommonCharacters method. The code then works like a charm :)

See http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance as well for some examples.

Try to avoid the two nested loops in the getCommonCharacters loop.
Suggestion as to how: store all the chars in the smaller string in a map of some sort(java has a few), where the key is the character and the value is the position, that way you can still calculate the distance, wether they are in common. I don't quite understand the algorithm, but I think this is doable.
Except for that and bmargulies's answer, I really don't see further optimizations beyond stuff like bits etc. If this is really critical, consider rewriting this portion in C?

I don't know much about Android and how it works with databases. WP7 has (will have :) ) SQL CE. The next step would typically be to work with your data. Add string lengths and limit your comparisons. Add indexes on both columns and sort by length and then by value. The index on length should be sorted as well. I had it run on an old server with 150 000 medical terms giving me suggestions and spell checking in under 0.5 seconds, users could barely notice it, especially if running on a separate thread.

I meant to blog about it for a long time (like 2 years :) ) because there is a need. But I finally manage to write few words about it and provide some tips. Please check it out here:

ISolvable.blogspot.com

Although it is for Microsoft platform, still general principles are the same.

Yes, this can be made a lot faster. For one thing, you don't need the StringBuffers at all. For another, you don't need a separate loop to count transpositions.

You can find my implementation here, and it should be a lot faster. It's under Apache 2.0 License.

Instead returning the common characters using GetCommonCharacters method, use a couple of arrays to keep the matches, similarly to the C version here https://github.com/miguelvps/c/blob/master/jarowinkler.c

/*Calculate matching characters*/
for (i = 0; i < al; i++) {
    for (j = max(i - range, 0), l = min(i + range + 1, sl); j < l; j++) {
        if (a[i] == s[j] && !sflags[j]) {
            sflags[j] = 1;
            aflags[i] = 1;
            m++;
            break;
        }
    }
}

Another optimization is to pre-calculate a bitmask for each string. Using that, check if the current character on the first string is present on the second. This can be done using efficient bitwise operations.

This will skip calculating the max/min and looping for missing characters.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow