Domanda

I encountered weird ukrainian word Кири́лл. I converted it to unicode and tested it with isalpha, which returned False. I looked around and found that this word contains character named 'combining acute accent'. So the letter и́ is actually a combination of two characters: и and ́. If I understood it correctly, combining marks (like this acute accent) are intended only to modify other characters. So isalpha should recognize this string as a word. Am I wrong? Is there any way to get correct results? The word in question in utf8:

word = '\xd0\x9a\xd0\xb8\xd1\x80\xd0\xb8\xcc\x81\xd0\xbb\xd0\xbb'

È stato utile?

Soluzione

I think you will need to replace the strings of any modifier characters since a modifier character is not considered alpha

modifiers = "\xcc\x81|<OTHER>|<MODIFIERS>"

text_to_analyze = re.sub(modifiers,"",my_text)
print unicode(text_to_analyze,"utf8").isalpha()
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top