Question

I use google-diff-match-patch C# library. I want to measure the similarity between two texts. To do this I make this C# code :

List<DiffMatchPatch.Diff> lDiffs = dmpDiff.diff_main(sTexte1, sTexte2);
int iIndex = dmpDiff.diff_levenshtein(lDiffs);
double dsimilarity = 100 - ((double)iIndex / Math.Max(sTexte1.Length, sTexte2.Length) * 100);

With similarity values between 0 - 100 (0 == perfect match - 100 == totaly different).

Do you think this is a good approach, that this calculation is correct?

Was it helpful?

Solution

I've had a look at diff_levenshtein on the API home page and it gives this description

Given a diff, measure its Levenshtein distance in terms of the number of inserted, deleted or substituted characters. The minimum distance is 0 which means equality, the maximum distance is the length of the longer string.

In the following line, all you are turning the distance (the change measurement) into a percentage of the original string length, and then substracting it from one hundred.

double dsimilarity = 100 - ((double)iIndex / Math.Max(sTexte1.Length, sTexte2.Length) * 100);

So, yes, this looks fine to me.

My only comment would be that the original algorithm uses 0 to represent a perfect match and you are using 100, which might be confusing. If you are ok with this, make your you comment it appropriately for future maintainers.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top