¿UTF-8, UTF-16 y UTF-32 difieren en el número de caracteres que pueden almacenar?

https://stackoverflow.com/questions/130438

02-07-2019
|

Pregunta

Está bien. Sé que se parece a la típica " ¿Por qué no solo la buscó en Google o fue a www.unicode? org y búsquelo? " pregunta, pero para una pregunta tan simple, la respuesta todavía me elude después de verificar ambas fuentes.

Estoy bastante seguro de que estos tres sistemas de codificación son compatibles con todos los caracteres Unicode, pero necesito confirmarlo antes de hacer esa afirmación en una presentación.

Pregunta de bonificación: ¿Estas codificaciones difieren en el número de caracteres que pueden extenderse para admitir?

Solución

No, son simplemente diferentes métodos de codificación. Todos admiten la codificación del mismo conjunto de caracteres.

UTF-8 utiliza de uno a cuatro bytes por carácter, dependiendo del carácter que esté codificando. Los caracteres dentro del rango ASCII toman solo un byte, mientras que los caracteres muy inusuales toman cuatro.

UTF-32 usa cuatro bytes por carácter, independientemente del carácter que sea, por lo que siempre usará más espacio que UTF-8 para codificar la misma cadena. La única ventaja es que puede calcular el número de caracteres en una cadena UTF-32 solo contando bytes.

UTF-16 usa dos bytes para la mayoría de los caracteres, cuatro bytes para los inusuales.

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Otros consejos

There is no Unicode character that can be stored in one encoding but not another. This is simply because the valid Unicode characters have been restricted to what can be stored in UTF-16 (which has the smallest capacity of the three encodings). In other words, UTF-8 and and UTF-32 could be used to represent a wider range of characters than UTF-16, but they aren't. Read on for more details.

UTF-8

UTF-8 is a variable-length code. Some characters require 1 byte, some require 2, some 3 and some 4. The bytes for each character are simply written one after another as a continuous stream of bytes.

While some UTF-8 characters can be 4 bytes long, UTF-8 cannot encode 2^32 characters. It's not even close. I'll try to explain the reasons for this.

The software that reads a UTF-8 stream just gets a sequence of bytes - how is it supposed to decide whether the next 4 bytes is a single 4-byte character, or two 2-byte characters, or four 1-byte characters (or some other combination)? Basically this is done by deciding that certain 1-byte sequences aren't valid characters, and certain 2-byte sequences aren't valid characters, and so on. When these invalid sequences appear, it is assumed that they form part of a longer sequence.

You've seen a rather different example of this, I'm sure: it's called escaping. In many programming languages it is decided that the \ character in a string's source code doesn't translate to any valid character in the string's "compiled" form. When a \ is found in the source, it is assumed to be part of a longer sequence, like \n or \xFF. Note that \x is an invalid 2-character sequence, and \xF is an invalid 3-character sequence, but \xFF is a valid 4-character sequence.

Basically, there's a trade-off between having many characters and having shorter characters. If you want 2^32 characters, they need to be on average 4 bytes long. If you want all your characters to be 2 bytes or less, then you can't have more than 2^16 characters. UTF-8 gives a reasonable compromise: all ASCII characters (ASCII 0 to 127) are given 1-byte representations, which is great for compatibility, but many more characters are allowed.

Like most variable-length encodings, including the kinds of escape sequences shown above, UTF-8 is an instantaneous code. This means that, the decoder just reads byte by byte and as soon as it reaches the last byte of a character, it knows what the character is (and it knows that it isn't the beginning of a longer character).

For instance, the character 'A' is represented using the byte 65, and there are no two/three/four-byte characters whose first byte is 65. Otherwise the decoder wouldn't be able to tell those characters apart from an 'A' followed by something else.

But UTF-8 is restricted even further. It ensures that the encoding of a shorter character never appears anywhere within the encoding of a longer character. For instance, none of the bytes in a 4-byte character can be 65.

Since UTF-8 has 128 different 1-byte characters (whose byte values are 0-127), all 2, 3 and 4-byte characters must be composed solely of bytes in the range 128-256. That's a big restriction. However, it allows byte-oriented string functions to work with little or no modification. For instance, C's strstr() function always works as expected if its inputs are valid UTF-8 strings.

UTF-16

UTF-16 is also a variable-length code; its characters consume either 2 or 4 bytes. 2-byte values in the range 0xD800-0xDFFF are reserved for constructing 4-byte characters, and all 4-byte characters consist of two bytes in the range 0xD800-0xDBFF followed by 2 bytes in the range 0xDC00-0xDFFF. For this reason, Unicode does not assign any characters in the range U+D800-U+DFFF.

UTF-32

UTF-32 is a fixed-length code, with each character being 4 bytes long. While this allows the encoding of 2^32 different characters, only values between 0 and 0x10FFFF are allowed in this scheme.

Capacity comparison:

UTF-8: 2,097,152 (actually 2,166,912 but due to design details some of them map to the same thing)
UTF-16: 1,112,064
UTF-32: 4,294,967,296 (but restricted to the first 1,114,112)

The most restricted is therefore UTF-16! The formal Unicode definition has limited the Unicode characters to those that can be encoded with UTF-16 (i.e. the range U+0000 to U+10FFFF excluding U+D800 to U+DFFF). UTF-8 and UTF-32 support all of these characters.

The UTF-8 system is in fact "artificially" limited to 4 bytes. It can be extended to 8 bytes without violating the restrictions I outlined earlier, and this would yield a capacity of 2^42. The original UTF-8 specification in fact allowed up to 6 bytes, which gives a capacity of 2^31. But RFC 3629 limited it to 4 bytes, since that is how much is needed to cover all of what UTF-16 does.

There are other (mainly historical) Unicode encoding schemes, notably UCS-2 (which is only capable of encoding U+0000 to U+FFFF).

UTF-8, UTF-16, and UTF-32 all support the full set of unicode code points. There are no characters that are supported by one but not another.

As for the bonus question "Do these encodings differ in the number of characters they can be extended to support?" Yes and no. The way UTF-8 and UTF-16 are encoded limits the total number of code points they can support to less than 2^32. However, the Unicode Consortium will not add code points to UTF-32 that cannot be represented in UTF-8 or UTF-16. Doing so would violate the spirit of the encoding standards, and make it impossible to guarantee a one-to-one mapping from UTF-32 to UTF-8 (or UTF-16).

I personally always check Joel's post about unicode, encodings and character sets when in doubt.

All of the UTF-8/16/32 encodings can map all Unicode characters. See Wikipedia's Comparison of Unicode Encodings.

This IBM article Encode your XML documents in UTF-8 is very helpful, and indicates if you have the choice, it's better to choose UTF-8. Mainly the reasons are wide tool support, and UTF-8 can usually pass through systems that are unaware of unicode.

From the section What the specs say in the IBM article:

Both the W3C and the IETF have recently become more adamant about choosing UTF-8 first, last, and sometimes only. The W3C Character Model for the World Wide Web 1.0: Fundamentals states, "When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32. US-ASCII is upwards-compatible with UTF-8 (an US-ASCII string is also a UTF-8 string, see [RFC 3629]), and UTF-8 is therefore appropriate if compatibility with US-ASCII is desired." In practice, compatibility with US-ASCII is so useful it's almost a requirement. The W3C wisely explains, "In other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate. Possible reasons for choosing one of these include efficiency of internal processing and interoperability with other processes."

As everyone has said, UTF-8, UTF-16, and UTF-32 can all encode all of the Unicode code points. However, the UCS-2 (sometimes mistakenly referred to as UCS-16) variant can't~~, and this is the one that you find e.g. in Windows XP/Vista~~.

See Wikipedia for more information.

Edit: I am wrong about Windows, NT was the only one to support UCS-2. However, many Windows applications will assume a single word per code point as in UCS-2, so you are likely to find bugs. See another Wikipedia article. (Thanks JasonTrue)

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow