문제

in the online diveintopython3 book,it says that the advantage of utf-32 and utf-16 is that

UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the Nth character of a string in constant time, because the Nth character starts at the 4×Nth byte

can somebody explain this? if possible with an example..I am not sure I have quite understood it

도움이 되었습니까?

해결책

The usual encoding of Unicode is UTF-8; UTF-8 represents characters with a variable number of bytes. For instance, the “L” character is encoded with a single byte (0x4c) while the “é” is encoded with two bytes (0xc3, 0xa9). So in a UTF-8 encoding, the word “Lézard” takes 7 bytes, and you cannot get the Nth character without decoding all characters before (you don't know how many bytes each character needs).

In UTF-32, all characters use 4 bytes, so to get the Nth character, you only need to go to byte 4×(N-1). First character is at position 0, second at position 4, third at position 8, etc.

다른 팁

As Pavel said, character has little meaning, and their closest equivalents mean different things in different languages (See: Indic Script). Even though it is so, it is far easy to count whatever you think a character is, despite different meanings, in UTF-32. Be it a Latin 'A', Chandrakala, கா, etc. because of fixed width.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top