UTF-8 questions

https://softwareengineering.stackexchange.com/questions/401160

04-03-2021
|

Pergunta

When you encode a code point to code units based on UTF-8, then if the code point fits on 7 bits, the most significant bit is set to zero so that it tells you it is a character which is stored on 1 byte (or more precisely 7 bits).

If the codepoint occupies more than 7 bits, then the number of leading one bits of the first byte tell you how many code units constitute that given code point. According to the specification this sequence of one bits is always followed by a sigle zero bit which terminates it and therefore separates it from the start of the code point.

I have specific questions, please answer separately.

1) If the first byte makes it crystal clear how many bytes you should read for the codepoint, why is it that the first 2 bits of every continuation byte are set to “10”? Why are they necessary if you know exactly how many bytes are there? They seem to be wasting precious space.

2) The second question is what are the theoretical limits of UTF-8? Due to compatibility reasons, UTF-8 will always encode to a maximum of 4 code units. But others say that theoretically it is capable of encoding code points to up to 7 code units, which means that the first byte does not contain any of the code point bits. It is 7 one bits followed by the terminating zero. But if we start to make theories, then we could say UTF-8 could encode to an arbitrary amount of code units too if we did not limit the size indication to the first byte. For example the 52-bit nonexistent code point 0x8000000000000 could be stored as follows:

1111 1111 - 1100 1000 1000 0000 - 1000 0000 1000 0000 - 1000 0000 1000 0000 - 1000 0000 1000 0000 - 1000 0000

This would mean that this character is stored on 10 bytes.

Solução

Answer to Question 1: why is it that the first 2 bits of every continuation byte are set to “10”?

It lets you land at a random place in the sequence and unambiguously work back to the beginning of the current code point (or forward to the start of the next one).

If you are starting from the beginning of a sequence, then you know the first byte is a leading byte, and can work forwards from there - easy. But if we aren't doing that, and if we didn't force the second-from-top bits of continuation bytes to zero, then we wouldn't be able to distinguish some continuation bytes from a start byte.

Consider as example: 110xxxxx 10xxxxxx, but in a world where we don't force the 0 in byte 2 and squeeze 7 bits of useful data in the second byte, we could have 11011111 11011111 as a legitimate char encoding, and we can't tell where the beginning is anymore.

It is self-synchronizing, as @ErikEidt notes in the comments. You can be dropped at a random place in the sequence, and back up no more than 3 bytes to find an unambiguous lead byte. You will also never find a shorter code buried in a longer one: without these properties, you wouldn't be able to go from this codepoint to the previous one in fast constant time (as highlighted in comments by @gnasher729). It means you can process a UTF-8 string in reverse with about the same cost overhead as working forwards.

Answer to Question 2: what are the theoretical limits of UTF-8?

I don't think I can put it better than Erik has in his comment to the question, which I shall copy here for posterity:

I think your second point is hypothetically possible. In addition, we don't have to keep to the same 11111111 pattern, we don't have to extend it one bit at a time, and we don't have to have that mean extending the data only one byte at a time. All that is technically required is to be able to positively differentiate between forms and there's lots of approaches for that, plus, different approaches can be mixed within a single code. – Erik Eidt

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange