Question

in the online diveintopython3 book,it says that the advantage of utf-32 and utf-16 is that

UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the Nth character of a string in constant time, because the Nth character starts at the 4×Nth byte

can somebody explain this? if possible with an example..I am not sure I have quite understood it

Was it helpful?

Solution

The usual encoding of Unicode is UTF-8; UTF-8 represents characters with a variable number of bytes. For instance, the “L” character is encoded with a single byte (0x4c) while the “é” is encoded with two bytes (0xc3, 0xa9). So in a UTF-8 encoding, the word “Lézard” takes 7 bytes, and you cannot get the Nth character without decoding all characters before (you don't know how many bytes each character needs).

In UTF-32, all characters use 4 bytes, so to get the Nth character, you only need to go to byte 4×(N-1). First character is at position 0, second at position 4, third at position 8, etc.

OTHER TIPS

As Pavel said, character has little meaning, and their closest equivalents mean different things in different languages (See: Indic Script). Even though it is so, it is far easy to count whatever you think a character is, despite different meanings, in UTF-32. Be it a Latin 'A', Chandrakala, கா, etc. because of fixed width.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top