Question

I'm getting 16 bits from a struct in memory, and I need to convert them into a string. The 16 bits represent a unicode char:

typedef struct my_struct {
    unsigned    unicode     : 16;
} my_struct;

I started by casting the bits into an unsigned char, which worked for values small enough to fit in one char. However, for characters like '♪', it truncates incorrectly. This is what I have so far:

        char buffer[2] = { 0 };
        wchar_t wc[1] = { 0 };

        wc[0] = page->text[index].unicode;
        std::cout << wc[0] << std::endl; //PRINT LINE 1
        int ret = wcstombs(buffer, wc, sizeof(buffer));
        if(ret < 0)
            printf("SOMETHING WENT WRONG \n");
        std::string my_string(buffer);
        printf("%s \n", my_string.c_str()); //PRINT LINE 2

Print line 1 currently prints: "9834" and print line 2 prints: "" (empty string). I'm trying to get my_string to contain '♪'.

Was it helpful?

Solution

If I've done my conversion correctly, 0x9834 in UTF-16 (16 bit Unicode) translates to the three byte sequence 0xE9, 0xA0, 0xB4 in UTF-8 (8 bit Unicode). I don't know about other narrow byte encodings, but I doubt any would be shorter than 2 bytes. You pass a buffer of two bytes to wcstombs, which means a returned string of at most 1 bytes. wcstombs stops translating (without failing!) when there's no more room in the destination buffer. You've also failed to L'\0' terminate the input buffer. It's not a problem at the moment, because wcstombs will stop translating before it gets there, but you should normally add the extra L'\0'.

So what to do:

First, and formost, when debugging this sort of thing, look at the return value of wcstombs. I'll bet that it's 0, because of the lack of space.

Second, I'd give myself a little bit of margin. Legal Unicode can result in up to four bytes in UTF-8, so I'd allocate at least 5 bytes for the output (don't forget the trailing '\0'). Along the same lines, you need a trailing L'\0' for the input. So:

char buffer[ 5 ];
wchar_t wc[] = { page->text[index].unicode, L'\0' };
int ret = wcstombs( buffer, wc, sizeof( buffer ) );
if ( ret < 1 ) {    //  And *not* 0
    std::cerr << "OOPS\n";
}
std::string str( buffer, buffer + ret );
std::cout << str << '\n';

Of course, after all that, there is still the question of what the (final) display device does with UTF-8 (or whatever the multi-byte narrow character encoding is---UTF-8 is almost universal under Unix, but I'm not sure about Windows.) But since you say that displaying "\u9834" seems to work, it should be alright.

OTHER TIPS

Please read a bit about what "character encoding" means, like this: What is character encoding and why should I bother with it

Then figure out what encoding you are getting in, and what encoding you need to use on the output. That means figuring out what your file format / GUI library / console is expecting.

Then use something reliable like libiconv to convert between them, instead of the so-implementation-defined-that-is-almost-useless wcstombs()+wchar_t.

For example, you might find that your input is UCS-2, and you need to output it into UTF-8. My system has 32-bit wchar_t, I wouldn't count on it converting from UCS-2 to UTF-8.

To convert from UTF-16 to UTF-8, use codecvt_utf8<char16_t>:

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

int main() {
    char16_t wstr16[2] = {0x266A, 0};
    auto conv = std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>{};
    auto u8str = std::string{conv.to_bytes(wstr16)};
    std::cout << u8str << '\n';
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top