Question

Given the accented unicode word like u'кни́га', I need to strip the acute (u'книга'), and also change the accent format to u'кни+га', where '+' represents the acute over the preceding letter.

What I do now is using a dictionary of acuted and not acuted symbols:

accented_list = [u'я́', u'и́', u'ы́', u'у́', u'э́', u'а́', u'е́', u'ю́', u'о́']
regular_list = [u'я', u'и', u'ы', u'у', u'э', u'а', u'е',  u'ю', u'о']
accent_dict = dict(zip(accented_list, regular_list))

I want to do something like this:

def changeAccentFormat(word):
  for letter in accent_dict:
    if letter in word:
      its_index = word.index(letter)
      word = word[:its_index + 1] + u'+' + word[its_index + 1:]
  return word

But of course it does not work as desired. I noticed that this code:

>>> word = u'кни́га'
>>> for letter in word:
...     print letter

gives

к
н
и                                                                                                                                                                                  
´   

г
а

(Well, i didn't expect the blank symbol to appear, but nevertheless). So I wonder, what is the simplest way to produce [u'к', u'н', u'и́', u'г', u'а']? Or maybe there is some way to solve my problem without it?

Was it helpful?

Solution

First of all, in regard to iterating over characters instead bytes, you're already doing it right - your word is an unicode object, not an encoded bytestring.

Now, for combination characters in Unicode:

For many characters containing combination characters there is a composed and decomposed form of writing it, the composed being one code point, and the decomposed a sequence of two (or more?) code points:

Composed and decomposed form of c with cedilla

See U+00E7, U+0063 and U+0327

So in Python, you could either write either form, it will get composed at display time to the same character:

>>> combining_cedilla = u'\u0327'
>>> c_with_cedilla = u'\u00e7'
>>> letter_c = u'\u0063'
>>>
>>> print c_with_cedilla
ç
>>> print letter_c + combining_cedilla
ç

In order to convert between composed and decomposed forms, you can use unicodedata.normalize():

>>> import unicodedata
>>> comp = unicodedata.normalize('NFC', letter_c + combining_cedilla)
>>> decomp = unicodedata.normalize('NFD', c_with_cedilla)
>>>
>>> print comp
ç
>>> print decomp
ç

(NFC stands for "normal form C" (composed), and NFD for "normal form D" (decomposed).

They still are different forms though - one consisting of one code point, the other of two:

>>> comp == decomp
False
>>> len(comp)
1
>>> len(decomp)
2

However, in your case, there simply does not seem to be a combined character for the lowercase и with an accent acute (there is one for и with an accent grave)

OTHER TIPS

You can produce [u'к', u'н', u'и́', u'г', u'а'] with the regex module.

Here is the word you have by each user perceived character:

>>> import regex
>>> word = u'кни́га'
>>> len(word)
6
>>> regex.findall(r'\X', word)
['к', 'н', 'и́', 'г', 'а']
>>> len(regex.findall(r'\X', word))
5

Acutes are represented by codepoint 301, COMBINING ACUTE ACCENT, so a simple string character replacement should suffice:

>>>print u'кни́га'.replace(u'\u0301', "+")
кни+га

If you encounter accented characters that are not encoded with a combining accent, unicodedata.normalize should do the trick

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top