PHP and C++ for UTF-8 code unit in reverse order in Chinese character

https://stackoverflow.com/questions/15972306

03-04-2022
|

Question

The unicode code point for the Chinese word 你好 is 4F60 , 597D respectively. which I got from this tool http://rishida.net/tools/conversion/

The console application below will print out the hexadecimal byte sequence of 你好 as 60:4F:7D:59 . As you can see it's in reverse order of the unicode code point for each character. 60 first then 4F, instead of 4F then 60. Why is it so ? Who is correct ? The tools or the console app ? Or both ?

void printHex (char * buf, char *filename)
{
    FILE *fp;
    fp=fopen(filename, "w");

    if(fp == NULL) return;

    int len2 = sizeof(buf);
    int i;
    char store[10];
    for (i = 0; i < sizeof(buf); i++)
    {
        if (i > 0) fprintf(fp,":");
        //sprintf(store, );

        fprintf(fp,"%02X", buf[i]);
    }
    fprintf(fp,"\n");
    fclose(fp);
}

int main(int argc, char* argv[])
{
    char * str3 = (char*)(L"你好");
    printHex( str3, "C:\\Users\\william\\Desktop\\My Document\\test2.txt");

        return 0;
}

While in PHP when I use this mb_convert_encoding function.

echo bin2hex(mb_convert_encoding("你好", "UTF-16", "UTF-8")); //result : 4f60 597d
echo bin2hex(mb_convert_encoding("恏絙", "UTF-16", "UTF-8")); //result : 604f 7d59

The PHP has the result same as the online tool, but when I use this encoding to print 你好 on a printer using php_printer.dll functions, the print out become 恏絙 and vice versa. But the C++ application can print out correctly. What could be wrong with PHP ? And the solution?

Solution

They're both correct. The difference is in endian-ness.

My guess is that UTF-16 will output the string as little-endian by default. You can enforce big-endianness by using UTF-16BE instead.

That, or the exact reverse ;)

Note that these are not unicode codepoints, but rather the UTF-16BE/LE/UCS-2 byte representation. Codepoints are a different set of numbers.

EDIT: Using UTF-16LE in mb_convert_encoding will give you to the reverse representation.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow