Pergunta

On Wikipedia, an implementation for the bottom-up dynamic programming scheme for the edit distance is given. It does not follow the definition completely; inner cells are computed thus:

if s[i] = t[j] then  
  d[i, j] := d[i-1, j-1]       // no operation required
else
  d[i, j] := minimum
             (
               d[i-1, j] + 1,  // a deletion
               d[i, j-1] + 1,  // an insertion
               d[i-1, j-1] + 1 // a substitution
             )
}

As you can see, the algorithm always chooses the value from the upper-left neighbour if there is a match, saving some memory accesses, ALU operations and comparisons.

However, deletion (or insertion) may result in a smaller value, thus the algorithm is locally incorrect, i.e. it breaks with the optimality criterion. But maybe the mistake does not change the end result -- it might be cancelled out.

Is this micro-optimisation valid, and why (not)?

Foi útil?

Solução

I don't think that the algorithm is flawed. If two strings are matched, we compare first its last two characters (and then recurse). If they are the same, we can match them to get an optimal alignment. For example, consider the strings test and testat. If you don't match the two last ts, than one of the ts remains unmatched, since otherwise your matching would look like this:

enter image description here

This is impossible, since the arrows are not allowed to "cross". The matched t induces several inserts (green boxes in the figure), as depicted on the left:

enter image description here

But then you can simply find an equally good alignment, depicted on the right. In both cases you match a t and you have two inserts.

The argument for a substitution of one of the last ts is the same. So if you substitute one of the last ts, then you can instead match the last two t, and get a better alignment (see the picture).

enter image description here

Licenciado em: CC-BY-SA com atribuição
Não afiliado a cs.stackexchange
scroll top