Question

I need to implement a string-matching algorithm to determine which strings most closely match. I see the the Hamming distance is a good matching algorithm when this fixed-length is obtainable.

Is there any advantage in the quality of matching if I were to use the Levenshtein distance formula instead? I know this method is less efficient, given that it accounts for variable-length strings, but what I'm really concerned with here are the quality of the matches. Also, are there any better algorithms out there I may want to consider? I work in Java if that makes any difference.

http://en.wikipedia.org/wiki/Levenshtein_distance

http://en.wikipedia.org/wiki/Hamming_distance

Much thanks

Was it helpful?

Solution

Consider the strings: "abcdefg" and "bcdefgh".

The Levenshtein distance is 2. The Hamming distance (operating on characters rather than bits) is 7.

So it really depends whether you want to treat those strings as being similar, or not. Hamming distance has its appropriate uses, but "will these strings look similar to a human being?" is not one of them.

OTHER TIPS

You may find interesting the Bitap algorithm.

The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates-Gonnet algorithm) is a fuzzy string searching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top