What are inconveniences of using UTF-8 instead of wchar_t with non-Western languages? [closed]

Question

Assuming the library functions work for UTF-8 (this is not true for Windows generally), then there's no real problem as long as you actually USE library functions. However, if you write code that manually interprets individual elements in a string array, you need to write code that takes into account that a code-point is more than a single byte in UTF-8 - particularly when dealing with non-English characters (including for example German/Scandinavian characters such as 'ä', 'ö', 'ü'). And even with 16-bit per entry, you can find situations where one code-point takes up 2 16-bit entries.

If you don't take this into account, the separate parts can "confuse" processing, e.g. recognise things in the middle of a code-point as having a different meaning than being the middle of something.

The variable length of a code-point leads to all sorts of interesting effects on for example string lengths and substrings - where the length in is in number of elements of the array holding the string, which can be quite different from the number of code-points.

Whichever encoding is used, there are further complications with for example Arabic languages, where individual characters need to be chained together. This is of course only important when actually drawing characters, but is worth at least bearing in mind.

Terminology (for my writings!):

Character = A letter/symbol such that can be displayed on screen.

Code-point = representation of a character in a string, may be one or more elements in a string array.

String array = the storage for a string, consists of elements of a fixed size (e.g. 8 bits, 16 bits, 32 bits, 64 bits)

String Element = One unit of a string array.