Question

Apart from storage size differences, what are the differences between using wchar_t (2-byte or 4-byte) and using UTF-8 encoding for text processing programming oriented to non-Western languages?

When using wchar_t, one can use wide versions of string functions in C or C++ libraries in the same way and easiness as non-wide ones. Are there some issues with UTF-8 which add some additional processing to strings with non-western text compared to using wide versions of standard string functions?

Was it helpful?

Solution

Assuming the library functions work for UTF-8 (this is not true for Windows generally), then there's no real problem as long as you actually USE library functions. However, if you write code that manually interprets individual elements in a string array, you need to write code that takes into account that a code-point is more than a single byte in UTF-8 - particularly when dealing with non-English characters (including for example German/Scandinavian characters such as 'ä', 'ö', 'ü'). And even with 16-bit per entry, you can find situations where one code-point takes up 2 16-bit entries.

If you don't take this into account, the separate parts can "confuse" processing, e.g. recognise things in the middle of a code-point as having a different meaning than being the middle of something.

The variable length of a code-point leads to all sorts of interesting effects on for example string lengths and substrings - where the length in is in number of elements of the array holding the string, which can be quite different from the number of code-points.

Whichever encoding is used, there are further complications with for example Arabic languages, where individual characters need to be chained together. This is of course only important when actually drawing characters, but is worth at least bearing in mind.

Terminology (for my writings!):

Character = A letter/symbol such that can be displayed on screen.

Code-point = representation of a character in a string, may be one or more elements in a string array.

String array = the storage for a string, consists of elements of a fixed size (e.g. 8 bits, 16 bits, 32 bits, 64 bits)

String Element = One unit of a string array.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top