UTF-8, UTF-16, and UTF-32

https://stackoverflow.com/questions/496321

20-08-2019
|

Question

What are the differences between UTF-8, UTF-16, and UTF-32?

I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?

Solution

UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text, because UTF-8 encodes all characters into 8 bits (like ASCII). It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file.

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.

UTF-32 will cover all possible characters in 4 bytes. This makes it pretty bloated. I can't think of any advantage to using it.

OTHER TIPS

In short:

UTF-8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.
UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.
UTF-32: Fixed-width encoding. All code points take four bytes. An enormous memory hog, but fast to operate on. Rarely used.

In long: see Wikipedia: UTF-8, UTF-16, and UTF-32.

UTF-8 is variable 1 to 4 bytes.
UTF-16 is variable 2 or 4 bytes.
UTF-32 is fixed 4 bytes.

Unicode defines a single huge character set, assigning one unique integer value to every graphical symbol (that is a major simplification, and isn't actually true, but it's close enough for the purposes of this question). UTF-8/16/32 are simply different ways to encode this.

In brief, UTF-32 uses 32-bit values for each character. That allows them to use a fixed-width code for every character.

UTF-16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set. So some characters use pairs of 16-bit values.

And UTF-8 uses 8-bit values by default, which means that the 127 first values are fixed-width single-byte characters (the most significant bit is used to signify that this is the start of a multi-byte sequence, leaving 7 bits for the actual character value). All other characters are encoded as sequences of up to 4 bytes (if memory serves).

And that leads us to the advantages. Any ASCII-character is directly compatible with UTF-8, so for upgrading legacy apps, UTF-8 is a common and obvious choice. In almost all cases, it will also use the least memory. On the other hand, you can't make any guarantees about the width of a character. It may be 1, 2, 3 or 4 characters wide, which makes string manipulation difficult.

UTF-32 is opposite, it uses the most memory (each character is a fixed 4 bytes wide), but on the other hand, you know that every character has this precise length, so string manipulation becomes far simpler. You can compute the number of characters in a string simply from the length in bytes of the string. You can't do that with UTF-8.

UTF-16 is a compromise. It lets most characters fit into a fixed-width 16-bit value. So as long as you don't have Chinese symbols, musical notes or some others, you can assume that each character is 16 bits wide. It uses less memory than UTF-32. But it is in some ways "the worst of both worlds". It almost always uses more memory than UTF-8, and it still doesn't avoid the problem that plagues UTF-8 (variable-length characters).

Finally, it's often helpful to just go with what the platform supports. Windows uses UTF-16 internally, so on Windows, that is the obvious choice.

Linux varies a bit, but they generally use UTF-8 for everything that is Unicode-compliant.

So short answer: All three encodings can encode the same character set, but they represent each character as different byte sequences.

Unicode is a standard and about UTF-x you can think as a technical implementation for some practical purposes:

UTF-8 - "size optimized": best suited for Latin character based data (or ASCII), it takes only 1 byte per character but the size grows accordingly symbol variety (and in worst case could grow up to 6 bytes per character)
UTF-16 - "balance": it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character handling (but size is still variable and can grow up to 4 bytes per character)
UTF-32 - "performance": allows using of simple algorithms as result of fixed size characters (4 bytes) but with memory disadvantage

I tried to give a simple explanation in my blogpost.

UTF-32

requires 32 bits (4 bytes) to encode any character. For example, in order to represent the "A" character code-point using this scheme, you'll need to write 65 in 32-bit binary number:

00000000 00000000 00000000 01000001 (Big Endian)

If you take a closer look, you'll note that the most-right seven bits are actually the same bits when using the ASCII scheme. But since UTF-32 is fixed width scheme, we must attach three additional bytes. Meaning that if we have two files that only contain the "A" character, one is ASCII-encoded and the other is UTF-32 encoded, their size will be 1 byte and 4 bytes correspondingly.

UTF-16

Many people think that as UTF-32 uses fixed width 32 bit to represent a code-point, UTF-16 is fixed width 16 bits. WRONG!

In UTF-16 the code point maybe represented either in 16 bits, OR 32 bits. So this scheme is variable length encoding system. What is the advantage over the UTF-32? At least for ASCII, the size of files won't be 4 times the original (but still twice), so we're still not ASCII backward compatible.

Since 7-bits are enough to represent the "A" character, we can now use 2 bytes instead of 4 like the UTF-32. It'll look like:

00000000 01000001

UTF-8

You guessed right.. In UTF-8 the code point maybe represented using either 32, 16, 24 or 8 bits, and as the UTF-16 system, this one is also variable length encoding system.

Finally we can represent "A" in the same way we represent it using ASCII encoding system:

01001101

A small example where UTF-16 is actually better than UTF-8:

Consider the Chinese letter "語" - its UTF-8 encoding is:

11101000 10101010 10011110

While its UTF-16 encoding is shorter:

10001010 10011110

In order to understand the representation and how it's interpreted, visit the original post.

UTF-8

has no concept of byte-order
uses between 1 and 4 bytes per character
ASCII is a compatible subset of encoding
completely self-synchronizing e.g. a dropped byte from anywhere in a stream will corrupt at most a single character
pretty much all European languages are encoded in two bytes or less per character

UTF-16

must be parsed with known byte-order or reading a byte-order-mark (BOM)
uses either 2 or 4 bytes per character

UTF-32

every character is 4 bytes
must be parsed with known byte-order or reading a byte-order-mark (BOM)

UTF-8 is going to be the most space efficient unless a majority of the characters are from the CJK (Chinese, Japanese, and Korean) character space.

UTF-32 is best for random access by character offset into a byte-array.

I made some tests to compare database performance between UTF-8 and UTF-16 in MySQL.

Update Speeds

UTF-8

Enter image description here

UTF-16

Enter image description here

Insert Speeds

Enter image description here

Delete Speeds

Enter image description here

In UTF-32 all of characters are coded with 32 bits. The advantage is that you can easily calculate the length of the string. The disadvantage is that for each ASCII characters you waste an extra three bytes.

In UTF-8 characters have variable length, ASCII characters are coded in one byte (eight bits), most western special characters are coded either in two bytes or three bytes (for example € is three bytes), and more exotic characters can take up to four bytes. Clear disadvantage is, that a priori you cannot calculate string's length. But it's takes lot less bytes to code Latin (English) alphabet text, compared to UTF-32.

UTF-16 is also variable length. Characters are coded either in two bytes or four bytes. I really don't see the point. It has disadvantage of being variable length, but hasn't got the advantage of saving as much space as UTF-8.

Of those three, clearly UTF-8 is the most widely spread.

Depending on your development environment you may not even have the choice what encoding your string data type will use internally.

But for storing and exchanging data I would always use UTF-8, if you have the choice. If you have mostly ASCII data this will give you the smallest amount of data to transfer, while still being able to encode everything. Optimizing for the least I/O is the way to go on modern machines.

As mentioned, the difference is primarily the size of the underlying variables, which in each case get larger to allow more characters to be represented.

However, fonts, encoding and things are wickedly complicated (unnecessarily?), so a big link is needed to fill in more detail:

http://www.cs.tut.fi/~jkorpela/chars.html#ascii

Don't expect to understand it all, but if you don't want to have problems later it's worth learning as much as you can, as early as you can (or just getting someone else to sort it out for you).

Paul.

In short, the only reason to use UTF-16 or UTF-32 is to support non-English and ancient scripts respectively.

I was wondering why anyone would chose to have non-UTF-8 encoding when it is obviously more efficient for web/programming purposes.

A common misconception - the suffixed number is NOT an indication of its capability. They all support the complete Unicode, just that UTF-8 can handle ASCII with a single byte, so is MORE efficient/less corruptible to the CPU and over the internet.

Some good reading: http://www.personal.psu.edu/ejp10/blogs/gotunicode/2007/10/which_utf_do_i_use.html and http://utf8everywhere.org

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow