vswprintf fails for certain unicode codepoints under Mac OS X

https://stackoverflow.com/questions/15438968

24-03-2022
|

Question

I am getting inexplicable failures (return value -1) from vswprintf using GCC and Mac OS X (tested with gcc 4.0 and 4.2.1 under Mac OS X 10.6 and 10.8. GCC under Linux is not affected. Visual Studio is also not affected).

To demonstrate the problem I have minimally adapted the example from here so that it prints out vswprintf's return value:

/* vswprintf example */
#include <stdio.h>
#include <stdarg.h>
#include <wchar.h>

void PrintWide ( const wchar_t * format, ... )
{
    wchar_t buffer[256];
    va_list args;
    va_start ( args, format );
    int res = vswprintf ( buffer, 256, format, args );
    wprintf ( L"result=%d\n", res );
    fputws ( buffer, stdout );
    va_end ( args );
}

int main ()
{
    wchar_t str[] = L"test string has %d wide characters.\n";
    PrintWide ( str, wcslen(str) );
    return 0;
}

From my tests it appears that, depending on the value of str, vswprintf will sometimes fail. Examples:

wchar_t str[] = L"test string has %d wide characters.\n"; // works
wchar_t str[] = L"ßß® test string has %d wide characters.\n"; // works
wchar_t str[] = L"日本語 test string has %d wide characters.\n"; // FAILS
wchar_t str[] = L"Π test string has %d wide characters.\n"; // FAILS
wchar_t str[] = L"\u03A0 test string has %d wide characters.\n"; // FAILS

It appears that any strings that include characters with Unicode codepoints above 0xff will trigger this problem. Can anyone shed some light into why this is happening? It seems like too big an issue to not have been noticed before!

Solution

If you set the locale, it should be fine. To pick up the environment variable you can do this:

setlocale(LC_CTYPE, "");   // include <locale.h>

or set it explicitly. This is because all of the output functions need to know which encoding to use.

OS X is failing to perform the vswprintf at all, while Linux runs it (though the characters will be incorrect if printed).

Here's the relevant section from the glibc documentation:

   If  the  format  string contains non-ASCII wide characters, the program
   will only work correctly if the LC_CTYPE category of the current locale
   at  run time is the same as the LC_CTYPE category of the current locale
   at compile time.  This is because the wchar_t representation  is  plat‐
   form-  and  locale-dependent.   (The  glibc  represents wide characters
   using their Unicode (ISO-10646) code point, but other  platforms  don't
   do  this.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow