First of all, in regard to iterating over characters instead bytes, you're already doing it right - your word
is an unicode object, not an encoded bytestring.
Now, for combination characters in Unicode:
For many characters containing combination characters there is a composed and decomposed form of writing it, the composed being one code point, and the decomposed a sequence of two (or more?) code points:
So in Python, you could either write either form, it will get composed at display time to the same character:
>>> combining_cedilla = u'\u0327'
>>> c_with_cedilla = u'\u00e7'
>>> letter_c = u'\u0063'
>>>
>>> print c_with_cedilla
ç
>>> print letter_c + combining_cedilla
ç
In order to convert between composed and decomposed forms, you can use unicodedata.normalize()
:
>>> import unicodedata
>>> comp = unicodedata.normalize('NFC', letter_c + combining_cedilla)
>>> decomp = unicodedata.normalize('NFD', c_with_cedilla)
>>>
>>> print comp
ç
>>> print decomp
ç
(NFC
stands for "normal form C" (composed), and NFD
for "normal form D" (decomposed).
They still are different forms though - one consisting of one code point, the other of two:
>>> comp == decomp
False
>>> len(comp)
1
>>> len(decomp)
2
However, in your case, there simply does not seem to be a combined character for the lowercase и
with an accent acute (there is one for и
with an accent grave)