Xcode 4.6 (4H127), clang warns 'illegal character encoding in string literal' for ISO-8859-1 encoded o-umlaut (0xF6)

StackOverflow https://stackoverflow.com/questions/14717356

Pergunta

This code compiled in the previous release of Xcode. I updated Xcode, and now compilation fails. I'm guessing there is something wrong with my code. The question mark in the code below is o-umlaut (ö) encoded according to ISO-8859-1 (0xF6)--we used to call this upper (or extended) ASCII. I'm guessing the compilation error has something to do with moving to UTF-8 input encoding for clang??

$ xcrun -sdk macosx10.8 -run clang -v
Apple LLVM version 4.2 (clang-425.0.24) (based on LLVM 3.2svn)
Target: x86_64-apple-darwin12.2.0

$ cat test.c
#include <stdio.h>
int main( int argc, char** argv )
{
    fprintf( stderr, "?\n" );
    return 0;
}

$ xcrun -sdk macosx10.8 -run clang -o test test.c 
test.c:4:23: warning: illegal character encoding in string literal [-Winvalid-source-encoding]
    fprintf( stderr, "<F6>\n" );
                      ^~~~
1 warning generated.
Foi útil?

Solução

So, it seems that clang from the latest Xcode (4.6) accepts UTF-8 encoding and complains about upper (or extended) ASCII, because upper ASCII for universal character set (UCS) code points according to ISO-8859-1 mixed into your source does not result in proper UTF-8 encoding. I haven't checked the release notes to verify that the new clang requires UTF-8, but I changed my source to have a proper UTF-8-encoded little o-umlaut, and it compiled.

0xF6 or 246 is the UCS code point for little o-umlaut. However, to properly encode it in UTF-8 you cannot just place 0xF6 in a single byte in your file. The proper UTF-8 encoding is two bytes: 0xC3 0xB6. See details below. So crack open your favorite hex editor and replace the one 0xF6 character with two characters: 0xC3 0xB6.

Here is a great hex editor: Hex Fiend

So, what if your problem character isn't o-umlaut? I've included a list of a few common characters, but you can follow the steps below to find any other UTF-8 encoding to solve your particular problem:

| Char | ISO-8859-1 |   UTF-8   |
| ---- | ---------- | --------- |
|  ©   |    0xA9    | 0xC2 0xA9 |
|  ®   |    0xAE    | 0xC2 0xAE |
|  Ä   |    0xC4    | 0xC3 0x84 |
|  Å   |    0xC5    | 0xC3 0x85 |
|  Æ   |    0xC6    | 0xC3 0x86 |
|  Ç   |    0xC7    | 0xC3 0x87 |
|  É   |    0xC9    | 0xC3 0x89 |
|  Ñ   |    0xD1    | 0xC3 0x91 |
|  Ö   |    0xD6    | 0xC3 0x96 |
|  Ü   |    0xDC    | 0xC3 0x9C |
|  ß   |    0xDF    | 0xC3 0x9F |
|  à   |    0xE0    | 0xC3 0xA0 |
|  á   |    0xE1    | 0xC3 0xA1 |
|  â   |    0xE2    | 0xC3 0xA2 |
|  ä   |    0xE4    | 0xC3 0xA4 |
|  å   |    0xE5    | 0xC3 0xA5 |
|  æ   |    0xE6    | 0xC3 0xA6 |
|  ç   |    0xE7    | 0xC3 0xA7 |
|  è   |    0xE8    | 0xC3 0xA8 |
|  é   |    0xE9    | 0xC3 0xA9 |
|  ê   |    0xEA    | 0xC3 0xAA |
|  ë   |    0xEB    | 0xC3 0xAB |
|  ì   |    0xEC    | 0xC3 0xAC |
|  í   |    0xED    | 0xC3 0xAD |
|  î   |    0xEE    | 0xC3 0xAE |
|  ï   |    0xEF    | 0xC3 0xAF |
|  ñ   |    0xF1    | 0xC3 0xB1 |
|  ò   |    0xF2    | 0xC3 0xB2 |
|  ó   |    0xF3    | 0xC3 0xB3 |
|  ô   |    0xF4    | 0xC3 0xB4 |
|  ö   |    0xF6    | 0xC3 0xB6 |
|  ù   |    0xF9    | 0xC3 0xB9 |
|  ú   |    0xFA    | 0xC3 0xBA |
|  û   |    0xFB    | 0xC3 0xBB |
|  ü   |    0xFC    | 0xC3 0xBC |
|  ÿ   |    0xFF    | 0xC3 0xBF |

Only lower ASCII (7-bit characters) can be encoded as a single character in UTF-8. See http://en.wikipedia.org/wiki/UTF-8.

Code points that are 8-11 bits in length are encoded in UTF-8 as:

110xxxxx  10xxxxxx

This being the case, 0xF6 followed by something that does not begin with the highest two bits set to 1 and 0 respectively is improperly encoded.

The proper encoding of this UCS code point (246 or 0xF6) in UTF-8 is 0xC3 0xB6 which looks like this:

11000011  10110110

Because encoding 0xF6 means taking the lower 6 bits and plugging them into the second byte and the higher 2 bits are added into the first byte. Example:

0xF6
11110110
   11    <-SPLIT->  110110
     \                 \
110xxxxx           10xxxxxx

Since 0xF6 is only 8 bits, the first 3 x's in the first byte can be set to 0. So you get:

11000011  10110110

Or:

0xC3 0xB6

Hopefully this can help you to properly encode whatever file you have the clang is choking on. I seem to run into this problem with open source. Many times the offending character is in a comment (author's name) in which case you can just modify it to be whatever you want. Sometimes you don't have the luxury of modifying the source code, in which case you should fix the encoding and send a patch to the maintainer.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top