UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

https://stackoverflow.com/questions/172133

05-07-2019
|

Question

I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows).

Solution

Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

EDIT 20140523: Also, watch Characters, Symbols and the Unicode Miracle by Tom Scott on YouTube - it's just under ten minutes, and a wonderful explanation of the brilliant 'hack' that is UTF-8

OTHER TIPS

A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.

UTF8 (UCS) uses 1 to 4 bytes for each symbol. Wikipedia gives a good rundown of how the multi-byte rundown works:

The most significant bit of a single-byte character is always 0.

The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.

The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.

A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a UTF-8 stream never looks like a UTF-16 stream starting with U+FEFF (Byte-order mark)

The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.

UTF16 (UCS2)

Uses 2 bytes to 4 bytes for each symbol.

UTF32 (UCS4)

uses 4 bytes always for each symbol.

char just means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.

STL:

Both stl's std::wstring and std::string are not designed for variable-length character encodings like UTF-8 and UTF-16.

How to implement:

Take a look at the iconv library. iconv is a powerful character encoding conversion library used by such projects as libxml (XML C parser of Gnome)

Other great resources on character encoding:

tbray.org's Characters vs. Bytes
IANA character sets
www.cs.tut.fi's A tutorial on code issues
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (first mentioned by @Dylan Beattie)

Received wisdom suggests that Spolsky's article misses a couple of important points.

This article is recommended as being more complete: The Unicode® Standard: A Technical Introduction

This article is also a good introduction: Unicode Basics

The latter in particular gives an overview of the character encoding forms and schemes for Unicode.

The various UTF standards are ways to encode "code points". A codepoint is the index into the Unicode charater set.

Another encoding is UCS2 which is allways 16bit, and thus doesn't support the full Unicode range.

Good to know is also that one codepoint isn't equal to one character. For example a character such as å can be represented both as a code point or as two code points one for the a and one for the ring.

Comparing two unicode strings thus requires normalization to get the canonical representation before comparison.

There is also the issue with fonts. There are two ways to handle fonts. Either you use a gigantic font with glyphs for all the Unicode characters you need (I think recent versions of Windows comes with one or two such fonts). Or you use som library capable of combining glyphs from various fonts dedicated to subsets of the Unicode standard.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow