Question

The Wikipedia article on Levenshtein distance says, in its possible modifications section, that "[w]e can store the number of insertions, deletions, and substitutions separately".

How is this done? I created an implementation of the matrix-based dynamic-programming solution outlined in the article, where each cell of the matrix has a separate value for deletions/insertions/substitutions, but for the life of me I can't seem to work out the logic.

Intuitively, it seems that if I find a need to do a deletion and an insertion in the same step, then those should become a substitution instead.

To make what I want completely clear, here's a sample matrix for the source string "mat" and the target string "catch". I expect this to require one substitution (i.e. changing "m" to "c") and two insertions (i.e. appending "ch"). Each cell is "deletions/insertions/substitutions".

           C     A     T     C     H
  +-----+-----+-----+-----+-----+-----+
  |D/I/S|0/1/0|0/2/0|0/3/0|0/4/0|0/5/0|
  +-----+-----+-----+-----+-----+-----+
M |1/0/0|0/0/1|0/1/1|0/2/1|0/3/1|0/4/1|
  +-----+-----+-----+-----+-----+-----+
A |2/0/0|1/0/1|0/0/1|0/1/1|0/2/1|0/3/1|
  +-----+-----+-----+-----+-----+-----+
T |3/0/0|2/0/1|1/0/1|0/0/1|0/1/1|0/2/1|
  +-----+-----+-----+-----+-----+-----+

Has anyone worked out this algorithm?

Was it helpful?

Solution

For a target cell x, we need to find the minimum of:

this + substitution | this + deletion
--------------------+----------------
this + insertion    |       x

From top-left is when we haven't processed to either value yet, so we must process both at the same time, thus it's a substitution.

From left is when we haven't processed the target value yet, so it's insertion.

From top is when we haven't processed the source value yet, so it's deletion.

To store the values separately, you'll need a 3D array:

array[targetSize+1][inputSize+1][3]

Then, for each of the 3 previous cells, you add 1 substitution, deletion or insertion (as indicated above), then calculate the total cost based on the number of substitutions, deletions and insertions, and find the minimum of the 3 costs. Then copy the values from the cell giving the minimum to the current cell (with the 1 operation added).

So, for:

0/1/0|0/2/0
-----+-----
0/0/1|  x

Let's assume a cost of 1 for each operation.

We calculate:
0/1/0 + 1 substitution = 0/1/1, cost = 2
0/0/1 + 1 insertion = 0/1/1, cost = 2
0/2/0 + 1 deletion = 1/2/0, cost = 3

Then we pick either of the cost 2's and put 0/1/1 into the new cell.

I hope that helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top