Levenshtein distances and special characters

https://stackoverflow.com/questions/10720545

10-06-2021
|

Frage

I have implemented a Levenshtein distance algorithm using a trie tree, as described here by Steve Hanov. However, I'm having difficulty handling special characters. For instance, if I calculate the distance between Großmann and Grossmann, I need the distance to be zero, since ß and ss should be considered equal.

What would be the best solution (if any) to support these special cases.

My initial thought was to kind of normalize all strings before calculating the distance. So in Großmann -> Grossman, österreich -> oesterreich, ... However, there seems to be no such functionality in .NET?

Lösung

The challenge is that the current culture does not identify the language of the individual words.

Assume you are willing to error on the side of match.

Identify a set of characters that never need to be mapped.

Identify a set mapping for all cultures.

Identify mappings for specific cultures.

First do an unmapped Levenshtein distance.

If the unmapped distance is is zero then stop.

If the unmapped distance is greater than x (e.g. 4) then stop as it is not a match.

If the word only has characters that never needs to be mapped (e.g. a-z) then stop.

Map both to all cultures and if the distance is zero stop.

Map to the default culture and if the distance is zero stop.

Map to other cultures and if the distance is zero stop.

And I added a straight string.compare to the Levenshtein to report 0 if true.

Andere Tipps

I think normalization is the way to go.

I'm not aware of any library that does this off-the-shelf, and a quick search didn't turn up anything.

A similar issue is discussed here: Converting "Bizarre" Chars in String to Roman Chars.

Their solution, to manually create a mapping will work, as long as you can comprehensively identify all the necessary mappings in advance.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow