Visual C++ 中的 Unicode 文字

https://stackoverflow.com//questions/25072236

26-12-2019
|

题

考虑以下代码：

#include <string>
#include <fstream>
#include <iomanip>

int main() {
    std::string s = "\xe2\x82\xac\u20ac";
    std::ofstream out("test.txt");
    out << s.length() << ":" << s << std::endl;
    out << std::endl;
    out.close();
}

在 Linux (Ubuntu 14.04) 上的 GCC 4.8 下，该文件 test.txt 包含这个：

6:€€

在 Windows 上的 Visual C++ 2013 下，它包含以下内容：

4:€\x80

（“\x80”是指单个 8 位字符 0x80）。

我完全无法让任何一个编译器输出 € 字符使用 std::wstring.

两个问题：

微软编译器到底认为它在做什么 char* 文字？显然它正在对它进行编码，但具体是什么还不清楚。
使用重写上述代码的正确方法是什么 std::wstring 和 std::wofstream 这样它就输出两个 € 人物？

解决方案

这是因为您正在使用 \u20ac 这是 ASCII 字符串中的 Unicode 字符文字。

MSVC 编码 "\xe2\x82\xac\u20ac" 作为 0xe2, 0x82, 0xac, 0x80, 这是 4 个窄字符。它本质上编码 \u20ac 作为 0x80 因为它将欧元字符映射到标准 1252 代码页

GCC 正在转换 Unicode 文字 /u20ac 到 3 字节 UTF-8 序列 0xe2, 0x82, 0xac 所以结果字符串最终为 0xe2, 0x82, 0xac, 0xe2, 0x82, 0xac.

如果你使用 std::wstring = L"\xe2\x82\xac\u20ac" 它被 MSVC 编码为 0xe2, 0x00, 0x82, 0x00, 0xac, 0x00, 0xac, 0x20 这是 4 个宽字符，但由于您将手工创建的 UTF-8 与 UTF-16 混合在一起，因此生成的字符串没有多大意义。如果您使用 std::wstring = L"\u20ac\u20ac" 正如您所期望的，您会在宽字符串中得到 2 个 Unicode 字符。

下一个问题是MSVC的ofstream和wofstream总是以ANSI/ASCII写入。要让它以 UTF-8 格式写入，您应该使用 <codecvt> （VS 2010 或更高版本）：

#include <string>
#include <fstream>
#include <iomanip>
#include <codecvt>

int main()
{
    std::wstring s = L"\u20ac\u20ac";

    std::wofstream out("test.txt");
    std::locale loc(std::locale::classic(), new std::codecvt_utf8<wchar_t>);
    out.imbue(loc);

    out << s.length() << L":" << s << std::endl;
    out << std::endl;
    out.close();
}

并写入 UTF-16（或更具体地说 UTF-16LE）：

#include <string>
#include <fstream>
#include <iomanip>
#include <codecvt>

int main()
{
    std::wstring s = L"\u20ac\u20ac";

    std::wofstream out("test.txt", std::ios::binary );
    std::locale loc(std::locale::classic(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>);
    out.imbue(loc);

    out << s.length() << L":" << s << L"\r\n";
    out << L"\r\n";
    out.close();
}

笔记：对于 UTF-16，您必须使用二进制模式而不是文本模式来避免损坏，因此我们不能使用 std::endl 并且必须使用 L"\r\n" 获得正确的行尾文本文件行为。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow