Rating the quality of string matches

https://stackoverflow.com/questions/4107188

29-09-2019
|

Question

What would be the best way to compare a pattern with a set of strings, one by one, while rating the amount with which the pattern matches each string? In my limited experience with regex, matching strings with patterns using regex seems to be a pretty binary operation...no matter how complicated the pattern is, in the end, it either matches or it doesn't. I am looking for greater capabilities, beyond just matching. Is there a good technique or algorithm that relates to this?

Here's an example:

Lets say I have a pattern foo bar and I want to find the string that most closely matches it out of the following strings:

foo for
foo bax
foo buo
fxx bar

Now, none of these actually match the pattern, but which non-match is the closest to being a match? In this case, foo bax would be the best choice, since it matches 6 out of the 7 characters.

Apologies if this is a duplicate question, I didn't really know what exactly to search for when I looked to see if this question already exists.

Solution

This one works, I checked with Wikipedia example distance between "kitten" and "sitting" is 3

   public class LevenshteinDistance {

    public static final String TEST_STRING = "foo bar";

    public static void main(String ...args){
        LevenshteinDistance test = new LevenshteinDistance();
        List<String> testList = new ArrayList<String>();
        testList.add("foo for");
        testList.add("foo bax");
        testList.add("foo buo");
        testList.add("fxx bar");
        for (String string : testList) {
          System.out.println("Levenshtein Distance for " + string + " is " + test.getLevenshteinDistance(TEST_STRING, string)); 
        }
    }

    public int getLevenshteinDistance (String s, String t) {
          if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
          }

          int n = s.length(); // length of s
          int m = t.length(); // length of t

          if (n == 0) {
            return m;
          } else if (m == 0) {
            return n;
          }

          int p[] = new int[n+1]; //'previous' cost array, horizontally
          int d[] = new int[n+1]; // cost array, horizontally
          int _d[]; //placeholder to assist in swapping p and d

          // indexes into strings s and t
          int i; // iterates through s
          int j; // iterates through t

          char t_j; // jth character of t

          int cost; // cost

          for (i = 0; i<=n; i++) {
             p[i] = i;
          }

          for (j = 1; j<=m; j++) {
             t_j = t.charAt(j-1);
             d[0] = j;

             for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost                
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);  
             }

             // copy current distance counts to 'previous row' distance counts
             _d = p;
             p = d;
             d = _d;
          } 

          // our last action in the above loop was to switch d and p, so p now 
          // actually has the most recent cost counts
          return p[n];
        }

}

OTHER TIPS

That's an interesting question! The first thing that came to mind is that the way regular expressions are matched is by building a DFA. If you had direct access to the DFA that was built for a given regex (or just built it yourself!) you could run the input measure the distance from the last state you transitioned to and an accept state, using a shortest path as a measure of how close it was to being accepted, but I'm not aware of any libraries that would let you do that easily and even this measure probably wouldn't exactly map onto your intuition in a number of cases.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow