Non-ASCII wchar_t literals under LLVM

https://stackoverflow.com/questions/13080325

14-07-2021
|

Question

I've migrated an Xcode iOS project from Xcode 3.2.6 to 4.2. Now I'm getting warnings when I try to initialize a wchar_t with a literal with a non-ASCII character:

wchar_t c1;
if(c1 <= L'я') //That's Cyrillic "ya"

The messages are:

MyFile.cpp:148:28: warning: character unicode escape sequence too long for its type [2] MyFile.cpp:148:28: warning: extraneous characters in wide character constant ignored [2]

And the literal does not work as expected - the comparison misfires.

I'm compiling with -fshort-wchar, the source file is in UTF-8. The Xcode editor displays the file fine. It compiled and worked on GCC (several flavors, including Xcode 3), worked on MSVC. Is there a way to make LLVM compiler recognize those literals? If not, can I go back to GCC in Xcode 4?

EDIT: Xcode 4.2 on Snow Leopard - long story why.

EDIT2: confirmed on a brand new project. File extension does not matter - same behavior in .m files. -fshort-wchar does not affect it either. Looks like I've gotta go back to GCC until I can upgrade to a version of Xcode where this is fixed.

Solution

If in fact the source is UTF-8 then this isn't correct behavior. However I can't reproduce the behavior in the most recent version of Xcode

MyFile.cpp:148:28: warning: character unicode escape sequence too long for its type [2]

This error should be refering to a 'Universal Character Name' (UCN), which looks like "\U001012AB" or "\u0403". It indicates that the value represented by the escape sequence is larger than the enclosing literal type is capable of holding. For example if the codepoint value requires more than 16 bits then a 16 bit wchar_t will not be able to hold the value.

MyFile.cpp:148:28: warning: extraneous characters in wide character constant ignored [2]

This indicates that the compiler thinks there's more than one codepoint represented inside a wide character literal. E.g. L'ab'. The behavior is implementation defined and both clang and gcc simply use the last codepoint value.

The code you show shouldn't trigger either of these, at least in clang. The first because that applies only to UCNs, let alone the fact that 'я' fits easily within a single 16-bit wchar_t; and the second because he source code encoding is always taken to be UTF-8 and it will see the UTF-8 multibyte representation of 'я' as a single codepoint.

You might recheck and ensure that the source actually is UTF-8. Then you should check that you're using an up-to-date version of Xcode. You can also try switching the compiler in your project settings > Compile for C/C++/Objective-C

OTHER TIPS

Not an answer, but hopefully helpful information — I could not reproduce the problem with clang 4.0 (Xcode 4.5.1):

$ uname -a
Darwin air 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64
$ env | grep LANG
LANG=en_US.UTF-8
$ clang -v
Apple clang version 4.0 (tags/Apple/clang-421.0.60) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.2.0
Thread model: posix
$ cat test.c
#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    wchar_t c1 = 0;
    printf("sizeof(c1) == %lu\n", sizeof(c1));
    printf("sizeof(L'Я') == %lu\n", sizeof(L'Я'));
    if (c1 < L'Я') {
        printf("Я люблю часы Заря!\n");
    } else {
        printf("Что за....?\n");
    }
    return EXIT_SUCCESS;
}

$ clang -Wall -pedantic ./test.c 
$ ./a.out 
sizeof(c1) == 4
sizeof(L'Я') == 4
Я люблю часы Заря!
$ clang -Wall -pedantic ./test.c -fshort-wchar
$ ./a.out 
sizeof(c1) == 2
sizeof(L'Я') == 2
Я люблю часы Заря!
$

The same behavior is observed with clang++ (where wchar_t is built-in type).

I dont have an answer to your specific question, but wanted to point out that llvm-gcc has been permanently discontinued. In my experience in dealing with delta's between Clang and llvm-gcc, and gcc, Clang is often correct with regards to the C++ specification even if that behavior is surprising.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow