Question on Levenshtein distance

https://stackoverflow.com/questions/857031

21-08-2019
|

Question

1) Why do we add 1 on these line?

    d[i-1, j] + 1, // deletion 
    d[i, j-1] + 1, // insertion

The line

if s[i] = t[j] then cost := 0

        else cost := 1

should take into account deleted/lower word lengths, or am I missing something?

2) Also, the comments state deletion and insertion. Am I right in thinking that it's checking for deleted characters in both words (the integers j/i representing the length of words), because a lower value will represent deleted characters.

The code used is here (because it is pseudo code and I have no language specific issues, this thread is not in any language category):

http://www.iterasi.net/openviewer.aspx?sqrlitid=z0cloj7xhk-ce0f72v4cjq

Solution

1) These lines compute the distance in the case of deletion, in the case of insertion, and the one using "cost" in case of a substitution...

deletion and insertion effectively count as "1" in the distance calculation, hence the +1.

We can believe there was a substitution only if the characters are different hence the "cost=0" if both chars are equal...

The new distance is then the minimum distance between these 3 hypothesis so you don't always add 1 ...

2) if I compute the distance between "FooBar" and "FoBaWhatever" I have some character deletions even if the second string is longer than the first one ...

Of course if the second string is shorter than the second ( FooBar -> FoBa ) I will find some deletions but cannot know in advance where they are...

OTHER TIPS

Have you read http://www.merriampark.com/ld.htm ?

You're computing the cost of transformation -- the number of inserts and deletes -- required to make one string into another.

This "cost" to transform is indicative of the distance between the two strings.

What about exchanges? That's the Damerau–Levenshtein algorithm, which is different. Including exchanges doesn't improve things much.

The essence is to create a matrix between the two words and compute -- column by column -- the "distance" from each letter of each word to each letter of the other word. The lower right hand corner of that matrix is the total distance, taking into account all of the letters.

Question 1)

The cell "above" reflects a history of changes, and the character for that row is (usually) different from this, so this cell is a deletion relative to it.

The cell "left" reflects a history of changes, and the character for that column is is (usually) different from this, so this cell is an insertion relative to it.

The only time this usually would be way wrong is words with a triple-letter sequence. Rare in English.

The row-column comparison has a cost of 0 or 1.

The minimum of "history plus one change" and the actual cost of a change is the applicable cost.

Question 2)

The variables i and j aren't lengths of anything. They're positions in the comparison matrix. The "insertion" and "deletion" is the action required to transform one word into the other. The count of insert/delete actions is the distance between the words.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow