質問

When I use this code to obtain integral value of 'س' in unicode I get 1587 (that is 633 in hex). This is right value of 'س' in unicode standard.

wchar_t wc = L'س';
cout<<wc; // or wcout<<int(wc);

Now, I put this character into a txt file with UTF-8 character encoding and then I check its value in hex mode. I obtain d8 b3 that means 55475 in decimal.

Why these values doesn't mach?

Added: Here is my code:

wchar_t wc = L'س';
FILE *f;
f = fopen("input1.txt", "w");
_setmode(_fileno(f), _O_U8TEXT);
fwprintf(f, L"%c", wc);
fclose(f);
役に立ちましたか?

解決

d8 b3 that means 55475 in decimal.

That is the correct encoding for Unicode Character 'ARABIC LETTER SEEN' in UTF-8. See here for a reference. 0xD8 0xB3 (d8b3). When I use your code and open it up with a text editor that understands UTF-8 without BOM, I can see the character. 1587 in decimal, is the value when the character is encoded in UTF-16 or UTF-32.

他のヒント

UTF-8 doesn't use all bits to represent the characters, since it need at least one bit to sign that the character point spans more bytes. You can see it here: https://en.wikipedia.org/wiki/UTF-8

From http://www.cl.cam.ac.uk/~mgk25/unicode.html, there are the code point ranges and their binary representation:

U-00000000 – U-0000007F:    0xxxxxxx
U-00000080 – U-000007FF:    110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF:    1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF:    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF:    111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF:    1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

edit: Making it clearer, D8B3 is the unicode hexadecimal representation of code point 1587.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top