Why does wprintf separate Unicode ligature into two different graphemes?

https://stackoverflow.com/questions/15874715

02-04-2022
|

Question

Code:

#include <stdio.h>
#include <wchar.h>
#define USE_W
int main()
{
#ifdef USE_W
    const wchar_t *ae_utf16 = L"\x00E6 & ASCII text ae\n";
    wprintf(ae_utf16);
#else
    const char *ae_utf8 = "\xC3\xA6 & ASCII text ae\n";
    printf(ae_utf8);
#endif
    return 0;
}

Output:

ae & ASCII text ae

While printf produces correct UTF-8 output:

æ & ASCII text ae

You can test this here.

Solution

printf just sends raw bytes to your terminal; it does not know anything about encodings. If your terminal happens to be configured to interpret that as UTF-8, it will show the right characters.

wprintf, on the other hand, does know about encodings. It behaves as though it uses the function wcrtomb, which encodes a wide character (wchar_t) into a multibyte sequence, depending on the current locale. If the default locale happens to be "C", which is quite minimalistic, the character æ gets converted to the "more or less equivalent" byte sequence ae.

If you set the locale explicitly to something using UTF-8, like "en_US.UTF-8", the output is as expected. Of course, the set of supported locales differs per system, so it's no good to hardcode this.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow