سؤال

I need to evaluate levenshtein edit distance on unicode strings, which means that two strings containing identical content will need to be normalized to avoid biasing the edit distance.

Here is how I generate random unicode strings for my tests:

def random_unicode(length=10):
    ru = lambda: unichr(random.randint(0, 0x10ffff))
    return ''.join([ru() for _ in xrange(length)])

And here is the simple test case that is failing:

import unicodedata
uni = random_unicode()
unicodedata.normalize(uni, 'NFD')

And here is the error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

I checked to make sure that uni was, indeed, a unicode object:

u'\U00020d93\U000fb2e6\U0005709a\U000bc31e\U00080262\U00034f00\U00059941\U0002dd09\U00074f6d\U0009ef7a'

Can someone enlighten me?

هل كانت مفيدة؟

المحلول

You've switched the parameters of normalize. From the relevant documentation:

unicodedata.normalize(form, unistr)

Return the normal form form for the Unicode string * unistr*. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The first argument is the form, and the second is the string to be normalized. This works just fine:

unicodedata.normalize('NFD', uni)
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top