Вопрос

I have the lovely functions from my previous question, which work fine if I do this:

wstring temp;
wcin >> temp;

string whatever( toUTF8(getSomeWString()) );

// store whatever, copy, but do not use it as UTF8 (see below)

wcout << toUTF16(whatever) << endl;

The original form is reproduced, but the in between form often contains extra characters. If I enter for example àçé as the input, and add a cout << whatever statement, i'll get ┬à┬ç┬é as output.

Can I still use this string to compare to others, procured from an ASCII source? Or asked differently: if I would output ┬à┬ç┬é through the UTF8 cout in linux, would it read àçé? Is the byte content of a string àçé, read in UTF8 linux by cin, exactly the same as what the Win32 API gets me?

Thanks!

PS: the reason I'm asking is because I need to use the string a lot to compare to other read values (comparing and concatenating...).

Это было полезно?

Решение

Let's start by me saying that it appears that there is simply no way to output UTF-8 text to the console in Windows via cout (assuming you compile with Visual Studio). What you can do however for your tests is to output your UTF-8 text via the Win32 API fn WriteConsoleA:

if(!SetConsoleOutputCP(CP_UTF8)) { // 65001
    cerr << "Failed to set console output mode!\n";
    return 1;
}
HANDLE const consout = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD nNumberOfCharsWritten;
const char* utf8 = "Umlaut AE = \xC3\x84 / ue = \xC3\xBC \n";
if(!WriteConsoleA(consout, utf8, strlen(utf8), &nNumberOfCharsWritten, NULL)) {
    DWORD const err = GetLastError();
    cerr << "WriteConsole failed with << " << err << "!\n";
    return 1;
}

This should output: Umlaut AE = Ä / ue = ü if you set your console (cmd.exe) to use the Lucida Console font.

As for your question (taken from your comment) if

a win23 API converted string is the same as a raw UTF8 (linux) string

I will say yes: Given a Unicode character sequence, it's UTF-16 (Windows wchar_t) representation converted to a UTF-8 (char) representation via the WideCharToMultiByte function will always yield the same byte sequence.

Другие советы

When you convert the string to a UTF 16 it is a 16 byte wide character, you can't compare it to the ASCII values because they aren't 16 byte values. You have to convert them to compare, or write a specialized comparision to ASCII function.

I doubt the UTF8 cout in linux would produce the same correct output unless it were regular ASCII values, as UTF8 UTF-8 encoding forms are binary-compatible with ASCII for code points below 128, and I assume UTF16 comes after UTF8 in a simliar fashion.

The good news is there are many converters out there written to convert these strings to different character sets.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top