Question

I just realized that (thanks to my university course) many of the things I thought I knew about unicode were wrong. Thus I started reading and fixing my knowledges and the following doubts immediately arose by playing around with a simple "Hello world" C++ program in MSVC2012:

#include <iostream>
#include <string.h>
using namespace std;

int main(void) {

    char arr1[] = "I am a nice boy"; // Is this stored as UTF-8 (multi-byte) or ASCII?
    char arr[] = "I'm a nice èboi"; // All characters should be ASCII except the 'è' one, which encoding is used for this?
    cout << strlen(arr); // Returns 15 as ASCII, why?

    // If I choose "multi-byte character set" in my VS project configuration instead of "unicode", what does this mean and what
    // will this affect?

    char arr2[] = "I'm a niße boy"; // And what encoding is it used here?
    cout << strlen(arr2); // Returns 1514, what does this mean?

    // If UTF-32 usually use 4 bytes to encode a character (even if they're not needed), how can a unicode code point like U+FFFF
    // (FFFF hexadecimal is 65535 in decimal) represent any possible unicode character if the maximum is FFFF ? (http://inamidst.com/stuff/unidata/)

    return 0;
}

The above was compiled with "multi-byte character set" but since multi-byte is a type of unicode encoding I guess(?) even this is not clear.

Can someone please help me out with clear explanations for the above questions?

Was it helpful?

Solution

    char arr1[] = "I am a nice boy"; // Is this stored as UTF-8 (multi-byte) or ASCII?

This is stored in the compiler's execution charset. The compiler gets to choose what this is and should document it. GCC lets you set the execution encoding with the flag -fexec-charset=charset but I think uses UTF-8 by default, MSVC uses the machine's 'encoding for non-Unicode applications' configured in the system language settings (which can never be UTF-8), and clang uses UTF-8 unconditionally.

char arr[] = "I'm a nice èboi"; // All characters should be ASCII except the 'è' one, which encoding is used for this?
cout << strlen(arr); // Returns 15 as ASCII, why?

The compiler execution charset actually doesn't have to be ASCII compatible at all. For example it could be EBDIC.

strlen(arr) returns 15 because the string literal, encoded using the compiler execution charset, is 15 bytes long. Since the string literal is 15 characters long this probably means that the compiler execution charset used a single byte for each of those characters, including 'è'. (And since UTF-8 cannot encoded that string in only 15 bytes that conclusively indicates that your compiler is not using UTF-8 as the compiler execution charset.)

char arr2[] = "I'm a niße boy"; // And what encoding is it used here?
cout << strlen(arr2); // Returns 1514, what does this mean?

The encoding does not change based on the content of the string. The compiler will always use the execution charset. I'm assuming '1514' is a typo and strlen(arr2) in fact return 14, because there are 14 characters in that string and since the earlier string seemed to use one byte per character as well.

If I choose "multi-byte character set" in my VS project configuration instead of "unicode", what does this mean and what will this affect?

That setting has nothing to do with the encodings used by the compiler. It just sets macros in Microsoft's headers to different things. TCHAR, all the macros that choose between *W and *A functions, etc.

In fact it's entirely possible to write a program using multi-byte character strings when you enable 'unicode' and one can also use unicode when you enable 'multi-byte character set'.

If UTF-32 usually use 4 bytes to encode a character (even if they're not needed), how can a unicode code point like U+FFFF (FFFF hexadecimal is 65535 in decimal) represent any possible unicode character if the maximum is FFFF ? (http://inamidst.com/stuff/unidata/)

This question makes no sense. Perhaps if you rephrase...

OTHER TIPS

char holds an 8-bit value in C++, regardless of everything else. So, those variables contain sequences of bytes. If they are in Unicode at all, which they might not be, then they are, thus, in UTF-8.

Accented characters in the Latin-1 set (such as è), have two representations in Unicode: composed and decomposed. The composed versions are a single character, the decomposed are two. You can look at resources such as http://www.fileformat.info/info/unicode/char/e8/index.htm; it would tell you that the character that you posted in your question is composed, and in UTF-8 it is 0xC3 0xA8 (c3a8) (two bytes).

It is also possible that you are compiling in the ACP for Latin1, not in Unicode at all, in which case all of these chars will be a single byte in length.

Your strlen of 1514 is incomprehensible to me; I want to wonder if char[] = "xxxx" does not initialize with a trailing zero, but I don't recall one way of the other. You could try changing those to char* instead and see you get a different answer.

If UTF-32 usually use 4 bytes to encode a character (even if they're not needed), how can a unicode code point like U+FFFF (FFFF hexadecimal is 65535 in decimal) represent any possible unicode character if the maximum is FFFF ? (http://inamidst.com/stuff/unidata/)

Your source is out of date. Unicode was limited to a max codepoint of U+FFFF back in the early days when UCS-2 was the only Unicode encoding, but Unicode outgrew that limit years ago. UTFs (UTF-8, UTF-16, UTF-32) were created to replace UCS-2 and extend the limit, which is currently codepoint U+10FFFF (the highest codepoint that UTF-16 can encode).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top