UTF8ToUTF16 failing

https://stackoverflow.com/questions/21662435

09-10-2022
|

Pregunta

I have the following code which is just three sets of functions for converting UTF8 to UTF16 and vice-versa. It converts using 3 different techniques..

However, all of them fail:

std::ostream& operator << (std::ostream& os, const std::string &data)
{
    SetConsoleOutputCP(CP_UTF8);
    DWORD slen = data.size();
    WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), data.size(), &slen, nullptr);
    return os;
}

std::wostream& operator <<(std::wostream& os, const std::wstring &data)
{
    DWORD slen = data.size();
    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), slen, &slen, nullptr);
    return os;
}

std::wstring AUTF8ToUTF16(const std::string &data)
{
    return std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(data);
}

std::string AUTF16ToUTF8(const std::wstring &data)
{
    return std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(data);
}

std::wstring BUTF8ToUTF16(const std::string& utf8)
{
    std::wstring utf16;
    int len = MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, NULL, 0);
    if (len > 1)
    {
        utf16.resize(len - 1);
        wchar_t* ptr = &utf16[0];
        MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, ptr, len);
    }
    return utf16;
}

std::string BUTF16ToUTF8(const std::wstring& utf16)
{
    std::string utf8;
    int len = WideCharToMultiByte(CP_UTF8, 0, utf16.c_str(), -1, NULL, 0, 0, 0);
    if (len > 1)
    {
        utf8.resize(len - 1);
        char* ptr = &utf8[0];
        WideCharToMultiByte(CP_UTF8, 0, utf16.c_str(), -1, ptr, len, 0, 0);
    }
    return utf8;
}

std::string CUTF16ToUTF8(const std::wstring &data)
{
    std::string result;
    result.resize(std::wcstombs(nullptr, &data[0], data.size()));
    std::wcstombs(&result[0], &data[0], data.size());
    return result;
}

std::wstring CUTF8ToUTF16(const std::string &data)
{
    std::wstring result;
    result.resize(std::mbstowcs(nullptr, &data[0], data.size()));
    std::mbstowcs(&result[0], &data[0], data.size());
    return result;
}

int main()
{
    std::string str = "консоли";

    MessageBoxA(nullptr, str.c_str(), str.c_str(), 0); //Works Fine!

    std::wstring wstr = AUTF8ToUTF16(str);  //Crash!
    MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Crash + Display nothing..

    wstr = BUTF8ToUTF16(str);
    MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Random chars..

    wstr = CUTF8ToUTF16(str);
    MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Question marks..

    std::cin.get();
}

The only thing that works above is the MessageBoxA. I don't understand why because I'm told that Windows converts everything to UTF16 anyway so why can't I convert it myself? Why does none of my conversions work?

Is there a reason my code does not work?

Solución

The root problem why all of your approaches fail is that they require the std::string to be UTF-8 encoded but std::string str = "консоли" is not UTF-8 encoded unless you save the .cpp file as UTF-8 and configure your compiler's default codepage to UTF-8. In most C++11 compilers, you can use the u8 prefix to force the string to use UTF-8:

std::string str = u8"консоли";

However, VS 2013 does not support that feature yet:

Support For C++11 Features

Unicode string literals 2010 No 2012 No 2013 No

Windows itself does not support UTF-8 in most API functions that take a char* as input (an exception is MultiByteToWideChar() when using CP_UTF8). When you call an A function, it calls the corresponding W function internally, converting any char* data to/from UTF-16 using Windows' default codepage (CP_ACP). So you get random results when you use non CP_ACP data with functions that are expecting it. As such, MessageBoxA() will work correctly only if your .cpp file and compiler are using the same codepage as CP_ACP so the unprefixed char* data matches what MessageBoxA() is expecting.

I don't know why AUTF8ToUTF16() is crashing, probably a bug in your compiler's STL implementation when processing bad data.

BUTF8ToUTF16() is not handling this case in the documentation: "If the input byte/char sequences are invalid, returns U+FFFD for UTF encodings." Also, your implementation is not optimal. Use length() instead of -1 on inputs to avoid dealing with null terminator issues.

CUTF8ToUTF16() is not doing any error handling or validations. However converting non-valid input to question marks or U+FFFD is very common in most libraries.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow