Visual C ++のUnicodeリテラル

https://stackoverflow.com//questions/25072236

26-12-2019
|

質問

次のコードを考慮してください。

#include <string>
#include <fstream>
#include <iomanip>

int main() {
    std::string s = "\xe2\x82\xac\u20ac";
    std::ofstream out("test.txt");
    out << s.length() << ":" << s << std::endl;
    out << std::endl;
    out.close();
}

GCC 4.8 Linux（Ubuntu 14.04）では、ファイルtest.txtにはこれが含まれています。

6:€€

Windows上のVisual C ++ 2013では、これが含まれています。

4:€\x80

（ '\ x80' by 'i：1つの8ビット文字0x80）を意味します。

€を使用してstd::wstring文字を出力するようにコンパイラを完全に取得できませんでした。

2つの質問：

Microsoftコンパイラは、char*リテラルで実行されているとどのようなことを正確に考えていますか？それは明らかにそれをエンコードする何かをしていますが、明確なものは何ですか。
std::wstring文字を出力するようにstd::wofstreamと€を使用して上記のコードを書き換える正しい方法は何ですか？

解決

これは、ASCII文字列のUnicode文字リテラルである\u20acを使用しているためです。

MSVCは、4つの狭い文字である"\xe2\x82\xac\u20ac"として0xe2, 0x82, 0xac, 0x80,をエンコードします。それはEURO文字を標準にマッピングする

gccは、Unicodeリテラル\u20acを3バイトのUTF-8シーケンス/u20acに変換しているため、結果の文字列は0xe2, 0x82, 0xacとして終了します。

0xe2, 0x82, 0xac, 0xe2, 0x82, 0xacを使用した場合、それは4幅の文字であるstd::wstring = L"\xe2\x82\xac\u20ac"としてMSVCによってエンコードされますが、手作りのUTF-8をUTF-16でミキシングしているため、結果の文字列はそれほど意味がありません。 0xe2, 0x00, 0x82, 0x00, 0xac, 0x00, 0xac, 0x20を使用している場合は、期待する場合は、1つのUnicode文字をWide-Stringに入手できます。

次の問題は、MSVCのストリームとwoftreamがANSI / ASCIIで常に書き込むことです。 UTF-8で書くようにするために、std::wstring = L"\u20ac\u20ac"（VS 2010以降）を使用する必要があります。

#include <string>
#include <fstream>
#include <iomanip>
#include <codecvt>

int main()
{
    std::wstring s = L"\u20ac\u20ac";

    std::wofstream out("test.txt");
    std::locale loc(std::locale::classic(), new std::codecvt_utf8<wchar_t>);
    out.imbue(loc);

    out << s.length() << L":" << s << std::endl;
    out << std::endl;
    out.close();
}

およびUTF-16（具体的にはUTF-16LE）を書く：

#include <string>
#include <fstream>
#include <iomanip>
#include <codecvt>

int main()
{
    std::wstring s = L"\u20ac\u20ac";

    std::wofstream out("test.txt", std::ios::binary );
    std::locale loc(std::locale::classic(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>);
    out.imbue(loc);

    out << s.length() << L":" << s << L"\r\n";
    out << L"\r\n";
    out.close();
}

UTF-16では、破損を避けるためにテキストモードではなくバイナリモードを使用する必要があるため、<codecvt>を使用できず、std::endlを使用して正しい行頭のテキストファイルの動作を取得する必要があります。< / P>

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow