Damerau–Levenshtein distance for language specific quirks

https://stackoverflow.com/questions/4593930

15-10-2019
|

문제

To Dutch speaking people the two characters "ij" are considered to be a single letter that is easily exchanged with "y".

For a project I'm working on I would like to have a variant of the Damerau–Levenshtein distance that calculates the distance between "ij" and "y" as 1 instead of the current value of 2.

I've been trying this myself but failed. My problem is that I do not have a clue on how to handle the fact that both texts are of different lengths. Does anyone have a suggestion/code fragment on how to solve this?

Thanks.

해결책

The Wikipedia article is rather loose with terminology. There are no such things as "strings" in "natural language". There are phonemes in natural language which can be represented by written characters and character-combinations.

Some character-combinations are vestiges of historical conventions which have survived into modern times, as in modern English "rough" where the "gh" can sound like -f- or make no sound at all. It seems to me that in focusing on raw "strings" the algorithm must be agnostic about the historical relationship of language and orthographic convention, which leads to some arbitrary metrics whenever character-combinations correlate to a single phoneme. How would it measure "rough" to "ruf"? Or "through" to "thru"? Or German o-umlaut to "oe"?

In your case the -y- can be exchanged phonetically and orthographically with -ij-. So what is that according to the algorithm, two deletions followed by an insertion, or a single deletion of the -j- or of the -i- followed by a transposition of the remaining character to -y-? Or is -ij- being coalesced and the coalescence is followed by a transposition?

I would recommend that you use another unused comnbining character for -ij- before applying the algorithm, perhaps U00EC, Latin small letter i with grave accent.

How does the algorithm handle multi-codepoint characters?

다른 팁

Well the D-L distance itself isn't going to handle it for you, due to the way it measure distances.

As there is no code (or language) involved here, I can only leave you with a suggestion to ensure all strings adhere to the same structure.

To clarify the situation since your asking in general terms,

bear in mind that the D-L distance compares character for character and doesn't actually read your strings in themselves, as such you'll have to parse before compare, as cases where ij shouldn't be exchanged with y will cause other issues instead.

An idea is to translate each string into some sort of constructed orthographemic representation, where digraphs such as "ij" and the english "gh" "th" and friends are only one character long. The distance metric does not have to be equal for all types of replactements when doing Damerau-Levenshtein so you can use whatever penalties you want, but the table needs to be filled locally, therefore you really want each sound to be one cell in the table.

This however breaks when the "ij" was not intended as "ij" but a misspelling or at a word-segmentation border (I don't know if that can happen in Dutch), or in any other situation it is not actually (meant as) a digraph.

Otherwise you will need to do some lookaround, this will complicate things but should not change the growth order of the algorithm (I believe), provided you only look at constant number of cells around. The constant factors will still be much bigger though.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow