Why does UTF-8 waste several bits in its encoding

https://softwareengineering.stackexchange.com/questions/262227

06-10-2020
|

Pergunta

According to the Wikipedia article, UTF-8 has this format:

First code Last code Bytes Byte 1    Byte 2    Byte 3    Byte 4
point      point     Used
U+0000     U+007F    1     0xxxxxxx
U+0080     U+07FF    2     110xxxxx  10xxxxxx
U+0800     U+FFFF    3     1110xxxx  10xxxxxx  10xxxxxx
U+10000    U+1FFFFF  4     11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
x means that this bit is used to select the code point.

This wastes two bits on each continuation byte and one bit in the first byte. Why is UTF-8 not encoded like the following?

First code Last code Bytes Byte 1    Byte 2    Byte 3
point      point     Used
U+0000     U+007F    1     0xxxxxxx
U+0080     U+3FFF    2     10xxxxxx  xxxxxxxx
U+0800     U+1FFFFF  3     110xxxxx  xxxxxxxx  xxxxxxxx

It would save one byte when the code point is out of the Basic Multilingual Plane or if the code point is in range [U+800,U+3FFF].

Why is UTF-8 not encoded in a more efficient way?

Solução

This is done so that you can detect when you are in the middle of a multi-byte sequence. When looking at UTF-8 data, you know that if you see 10xxxxxx, that you are in the middle of a multibyte character, and should back up in the stream until you see either 0xxxxxx or 11xxxxxx. Using your scheme, bytes 2 or 3 could easily end up with patters like either 0xxxxxxx or 11xxxxxx

Also keep in mind that how much is saved varies entirely on what sort of string data you are encoding. For most text, even Asian text, you will rarely if ever see four byte characters with normal text. Also, people's naive estimates about how text will look are often wrong. I have text localized for UTF-8 that includes Japanese, Chinese and Korean strings, yet it is actually Russian that takes most space. (Because our Asian strings often have Roman characters interspersed for proper names, punctuation and such and because the average Chinese word is 1-3 characters while the average Russian word is many, many more.)

Outras dicas

The official way lets the decoder know when it's in the middle of the tuple and it knows to skip bytes (or go backwards) until the byte starts with 0 or 11; this prevents garbage values when a single byte gets corrupted.

Short answer, your proposal does not differentiate between the first byte and continuation bytes.

The bit pattern at the high end of the first byte tells you with how many bytes the actual character is built. These patterns provide also some error recognition while parsing a string. If you are reading the (seemingly) first byte of a character and you get 10xxxxxx then you know that you are out of synch.

What hasn't been mentioned is that if you have a correct sequence of code points, and a pointer that is guaranteed to point to the first byte of a code point, with UTF-8 you can very easily find the pointer to the first byte of the previous code point (skip all bytes that start with 01xx xxxx). With your encoding, it's impossible without potentially examining all bytes up to the start of the string.

Consider the sequences of (2n + 2) bytes

0xxxxxxx
n times (10xxxxxx, 10xxxxxx)
0xxxxxxx

and

n times (10xxxxxx, 10xxxxxx)
(10xxxxxx, 0xxxxxxx)

If you have a pointer to the first byte of the first code point after this sequence, you must examine all bytes to find out whether the last codepoint is 0xxxxxxx or (10xxxxxx, 0xxxxxxx).

There are actually more efficient encoding schemes, where going to the previous code point can be done in constant time, and pointers to the middle of a code point can be fixed. Allow the following codes:

X where X < 128
YX where 128 ≤ Y < 236, X < 128
ZYY where 236 ≤ Z < 256, 0 ≤ Y < 236.

If one of the previous three bytes is ≥ 236 then it is the start of a 3 byte sequence, because there can be no two such bytes within any valid 3 byte sequence. Otherwise, if one of the previous two bytes is ≥ 128 then it is the start of a two byte sequence. Otherwise, the previous byte is a single byte < 128.

Searching for a substring becomes slightly more difficult. You may want to exclude zero bytes so that a string only contains a zero byte if it contains a zero code point.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange