Question

I'm working on internationalizing one of my programs for work. I'm trying to use foresight to avoid possible issues or redoing the process down the road.

I see references for UTF-8, UTF-16 and UTF-32. My question is two parts:

  1. What languages does UTF-8 not support?
  2. What advantages do UTF-16 and UTF-32 have over UTF-8?

If UTF-8 works for everything, then I'm curious what the advantage of UTF-16 and UTF-32 are (e.g. special search features in a database, etc) Having the understanding should help me finish designing my program (and database connections) properly. Thanks!

Was it helpful?

Solution

All three are just different ways to represent the same thing, so there are no languages supported by one and not another.

Sometimes UTF-16 is used by a system that you need to interoperate with - for instance, the Windows API uses UTF-16 natively.

In theory, UTF-32 can represent any "character" in a single 32-bit integer without ever needing to use more than one, whereas UTF-8 and UTF-16 need to use more than one 8-bit or 16-bit integer to do that. But in practise, with combining and non-combining variants of some codepoints, that's not really true.

One advantage of UTF-8 over the others is that if you have a bug whereby you're assuming that the number of 8-, 16- or 32-bit integers respectively is the same as the number of codepoints, it becomes obvious more quickly with UTF-8 - something will fail as soon as you have any non-ASCII codepoint in there, whereas with UTF-16 the bug can go unnoticed.

To answer your first question, here's a list of scripts currently unsupported by Unicode: http://www.unicode.org/standard/unsupported.html

OTHER TIPS

UTF8 is variable 1 to 4 bytes, UTF16 2 or 4 bytes, UTF32 is fixed 4 bytes.

That is why UTF-8 has an advantage where ASCII are most prevalent characters, UTF-16 is better where ASCII is not predominant, UTF-32 will cover all possible characters in 4 bytes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top