Question

Today I woke up and felt something was awfully wrong with my code and every library I've ever used, and I think I was right... (or please point out where my reasoning is wrong)

Let's start I decade or two back in time, all was well in the world. I spoke to my neighbor and he spoke the same language: just plain English. To me, my neighbor and Windows it seemed obvious to store our string in 8-bit chars because all characters we used could be stored in the 2^8=256 available combinations.

Then the miraculous being Internet came along and allowed me to speak to some friends in Europe (who had no time to learn English). This got difficult with our char format, the number of used characters exceeded 256 easily so in our utterly simplistic vision we decided to use the 16-bit wchar_ts. Something called UCS-2 unicode. It has 2^16=65.536 available combinations and that must be enough for every language in the world! Convinced of our correctness we even added 16-bit Windows API W functions like MessageBoxW and CreateWindowW. We convinced every programmer of our religion and discouraged the use of the evil 8-bit counterparts (MessageBoxA and CreateWindowA) and mapped a call to MessageBox automatically to MessageBoxW by defining _UNICODE in our builds. Therefore we should also use the wcs functions instead of the old str functions (e.g. strlen should now be wcslen, or use the automatically mapped _tcslen).

Then things got bad, it turned out there were other people in the world who used even weirder glyphs (no offence) than ours: Japanese, Chinese, etc. It got bad because for example Chinese has over 70.000 different characters. A lot of swearing occurred and left us with a new type of unicode: UTF-16. It also uses a 16-bit data type but some characters require two 16-bit values (called a surrogate pair). Which means we can't use indexes on these 16-bit strings (e.g. theString[4] may not return the 5th character). To patch the Windows API it was decided all W functions should now support the UTF-16 format, it was an easy decision since all old UCS-2 strings were valid UTF-16 strings as well. However, because we are brave programmers, we now use the wcs functions. Sadly these functions are not surrogate aware and still conform to the UCS-2 format...

In the meantime, in a dark attic, another more compact form of unicode was developed: UTF-8. Using an 8-bit data type most western languages can be stored in a single 8-bit value, just like in the old days. When a more exotic glyph is stored, multiple 8-bit values are used, for most European languages 2 will suffice. However it may expand up 4 of these values, essentially creating a 32-bit storage type. Just like it's fat brother UTF-16, we cannot use indexes on these strings. Because of it's more compact format UTF-8 is now widely used everywhere on the Internet because it saves bandwidth.

Good, you made it through my lengthy write-up :) Now I have some questions / points of interest:

  1. Okay, I'm pretty satisfied with using UTF-8 for storage. When I read a file (from disk or HTTP response) I detect the UTF-8 signature "\xEF\xBB\xBF" and put the contents through MultiByteToWideChar which leaves me with an UTF-16 string. I can use that with the W API functions, no problem. But now I want to modify the string, replace some characters etc. The good old wcs functions are no good anymore, what core string functions are UTF-16 aware? Or is there some splendid library out there I don't know off? Edit: It seems ICU is a pretty good solution. I also found that the wcs functions are not completely useless you can for instance still use wcsstr to search, it essentially just compares wchar_ts. The only problem is length of the string.

  2. Don't you have the feeling an ugly mistake was made when we were forced upon using 16-bit deficient W functions. Shouldn't the problem have been recognized in a much earlier stage and let all original API functions take on UTF-8 strings and incorporate proper string manipulation routines? Or is that already possible and am I horribly mistaken? Edit: Maybe this was a silly question, hindsight is indeed wonderful, no use in putting anyone down right now ;)

  3. For fast index access to the characters we should store strings in 32-bit values. Is this common? (I can hear you thinking: and then we hit an extraterrestrial language requiring more combinations and the fun starts all over again...) It seems the downside of this approach is that we should convert the string back to UTF-16 each time when we make Windows API calls. Edit: Just to quote Alf P. Steinbach one character per index is a hopeless dream, I see that now. One thing I completely missed out on was the diacritics. I also think it is a good thing to process in the OS's native encoding (for Windows UTF-16). Although UTF-8 would have been a better choice we're stuck with UTF-16 now, no point in converting back and forth between your code and the API. As suggested below I will keep track of parts in a string myself by pointers instead of a character count.

I think you deserved yourself a fine cup of tea struggling though this lengthy question, go get one before you answer ;)

Edit: I accept the fact my question is closed, this would be a better fit for a blog post, but then again I don't write a blog. I think this character encoding thing is essential and should be the next topic in any programming book after the simple hello world example! Posting it here draws attention of many experts, those people don't read any random blog and I highly value their opinion. So thanks everyone for contributing.

Was it helpful?

Solution

By strong preference, you should translate from UTF-* to UCS-4 as you read the data. All your processing should be done on UCS-4, and then (if necessary) translate back to UTF-* during output.

That still doesn't fix everything though. There's a set of "combining diacritical" marks, which mean that even when you use UCS-4, string[N] doesn't necessarily correspond to the Nth character of the string. There are transformations to canonical forms that attempt to help with that, but they can't always do the job, so if it's really critical (for your application), you just about need to walk through the string, divide it into units that each represent a complete character (base character + and combining diacriticals), and treat each of those as a unit.

OTHER TIPS

  1. ICU is an excellent Unicode string library. The general concept with string handling is to parse any external forms into memory such that each value is a complete code point, not some part of one, like with UTF-16 and UTF-8. Then, after any processing, on the way out of the program, serialise the string back to a suitable transformation format. Although the basics are easy, try not to roll your own Unicode library -- things like collation, searching and other complicated matters are best left to a mature library.

  2. Planes outside the BMP weren't used nor defined, as a need wasn't seen. Of course, as you have pointed out, there certainly is a need.

  3. Yes, this is common, and as mentioned, is the best way to do things as it improves almost all string operations greatly.

My take on the matter:

  • For the external interface (files, command line arguments, environment variables, stdin/out) use UTF-8, because that's a byte stream and the entire C and C++ language is designed around interfacing with the environment via byte streams. On most sensible filesystems, file names are (null-terminated) byte strings, too.

  • For simple parroting back, you can keep strings in UTF-8 internally as well, using char* etc., and plain "" string literals or the new u8"" UTF-8 literals.

  • For textual manipulation, convert the string into UTC-4/UTF-32 internally and treat it as an array of char32_t. That's the only sane way you can speak of a character stream.

  • UTF-16 was a huge mistake and should be shot and shunned. See here (I made a comment there somewhere), and maybe here and here.

  1. ICU — International Components for Unicode. For proper word breaks and display, Windows includes Uniscribe and non-Windows use FreeType (correct me if I'm wrong).

  2. Yes I do. But as far as I know, at the time they were making that decision, utf-32 did not exist and they thought 65536 code points “will be enough for everyone”.

  3. No it's not. Besides quadrupling memory usage, the problem is much worse than you think. You can not just “modify a string” and “replace some characters”: even when using 32-bit values, because one unicode character does not necessarily mean one written letter or one glyph that you can remove or replace with something else and hope nothing breaks. To work with text properly you will have to use something like ICU anyway, so there's no much difference between using utf-8 and utf-32 I think.

I don't know what you mean about the wcs functions being no good. Why not?

Don't you have the feeling an ugly mistake was made when we were forced upon using 16-bit deficient W functions. Shouldn't the problem have been recognized in a much earlier stage and let all original API functions take on UTF-8 strings and incorporate proper string manipulation routines? Or is that already possible and am I horribly mistaken?

UTF-8 was developed well after the Windows Unicode interface was written. Had they added a UTF-8 version there would now be 3 versions of every function. I'm sure they would not use UTF-16 if they were to start again—hindsight is truly wonderful.

Regarding UTF-32, hardly any software uses that internally. I wouldn't recommend it, especially not on a platform which has no support for it whatsoever. Using UTF-32 would just be creating work for yourself.

There's nothing stopping you from making a simple cache that stores the location and byte length of a UTF encoded codepoint so you can actually use random access. All the old C stuff you're talking about isn't going to help much though.

I also wouldn't trust on the UTF-8 'BOM' being available because it's nonsense and probably stripped away by some implementations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top