Question

I understand that char in C++ is just an integer type that stores ASCII symbols as numbers ranging from 0 to 127. The Scandinavian letters 'æ', 'ø', and 'å' are not among the 128 symbols in the ASCII table.

So naturally when I try char ch1 = 'ø' I get a compiler error, however string str = "øæå" works fine, even though a string makes use of chars right?

Does string somehow switch over to Unicode?

Was it helpful?

Solution 2

From the source code char c = 'ø';:

source_file.cpp:2:12: error: character too large for enclosing character literal type
  char c = '<U+00F8>';
           ^

What's happening here is that the compiler is converting the character from the source code encoding and determining that there's no representation of that character using the execution encoding that fits inside a single char. (Note that this error has nothing to do with the initialization of c, it would happen with any such character literal. examples)

When you put such characters into a string literal rather than a character literal, however, the compiler's conversion from the source encoding to the execution encoding is perfectly happy to use multi-byte representations of the characters when the execution encoding is multi-byte, such as UTF-8 is.

To better understand what compilers do in this area you should start by reading clauses 2.3 [lex.charsets], 2.14.3 [lex.ccon], and 2.14.5 [lex.string] in the C++ standard.

OTHER TIPS

In C++ there is the source character set and the execution character set. The source character set is what you can use in your source code; but this doesn't have to coincide with which characters are available at runtime.

It's implementation-defined what happens if you use characters in your source code that aren't in the source character set. Apparently 'ø' is not in your compiler's source character set, otherwise you wouldn't have gotten an error; this means that your compiler's documentation should include an explanation of what it does for both of these code samples. Probably you will find that str does have some sort of sequence of bytes in it that form a string.

To avoid this you could use character literals instead of embedding characters in your source code, in this case '\xF8'. If you need to use characters that aren't in the execution character set either, you can use wchar_t and wstring.

What's likely happening here is that your source file is encoded as UTF-8 or some other multi-byte character encoding, and the compiler is simply treating it as a sequence of bytes. A single char can only be a single byte, but a string is perfectly happy to be as many bytes as are required.

The ASCII for C++ is only 128 characters. If you want 'ø' which is ASCII-EXTENDED 248 out of (255) which is 8 bit (is not a character value) that included 7 bit from ASCII. you can try char ch1 ='\xD8';

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top