Micro-optimisation for edit distance computation: is it valid?
Pergunta
On Wikipedia, an implementation for the bottom-up dynamic programming scheme for the edit distance is given. It does not follow the definition completely; inner cells are computed thus:
if s[i] = t[j] then
d[i, j] := d[i-1, j-1] // no operation required
else
d[i, j] := minimum
(
d[i-1, j] + 1, // a deletion
d[i, j-1] + 1, // an insertion
d[i-1, j-1] + 1 // a substitution
)
}
As you can see, the algorithm always chooses the value from the upper-left neighbour if there is a match, saving some memory accesses, ALU operations and comparisons.
However, deletion (or insertion) may result in a smaller value, thus the algorithm is locally incorrect, i.e. it breaks with the optimality criterion. But maybe the mistake does not change the end result -- it might be cancelled out.
Is this micro-optimisation valid, and why (not)?
Solução
I don't think that the algorithm is flawed. If two strings are matched, we compare first its last two characters (and then recurse). If they are the same, we can match them to get an optimal alignment. For example, consider the strings test
and testat
.
If you don't match the two last t
s, than one of the t
s remains unmatched, since otherwise your matching would look like this:
This is impossible, since the arrows are not allowed to "cross". The matched t
induces several inserts (green boxes in the figure), as depicted on the left:
But then you can simply find an equally good alignment, depicted on the right. In both cases you match a t
and you have two inserts.
The argument for a substitution of one of the last t
s is the same. So if you substitute one of the last t
s, then you can instead match the last two t, and get a better alignment (see the picture).