Read/write unicode c++

https://stackoverflow.com/questions/12789273

06-07-2021
|

Pergunta

Sadly this is the third time this week I have to post a question.

I have to write text to a file with unicode encoding (or UTF8).
This is what I do:

creating wofstream mystream; and then I put a wstring in it like this mystream << L"hello world";

First question: what kind of encoding the stream uses in my case?

Secondly, I want to load my new file, but how to read the lines? The ifstream's getline is not working because the line ends up ruined obviously.

Solução

wchar_t, the type that backs wstream and wstring, is platform dependent: 2 bytes on Windows, 4 bytes on some (all?) Linux. So you will end up writing 'Unicode', but exactly which Unicode is subject to many variables. You may write UTF32/UCS4, you may end up with UTF16/UCS2.

If you want to write using a specific, well controlled encoding (eg. UTF8, or UCS-2LE vs. UCS-2BE to control endianess) then you need something like iconv. You can also use std::locale to imbue a stream, see https://stackoverflow.com/a/1275260/105929.

Outras dicas

By 'unicode encoding' I presume you mean UTF-16. There are actually several encodings that might be called Unicode encodings, but most people that aren't familiar with Unicode take it to mean UTF-16 (I think largely because Microsoft makes this mistake in all their documentation). My answer also assumes you're writing code for Windows and therefore that your internal data is UTF-16 stored in wchar_t strings.

Using a wide stream object does not imply that the file input or output will be done using wide characters. In fact, a wide stream will use a codecvt facet of the stream's locale in order to convert between the stream's character type (wchar_t) and char.

In C++11 there are a few codecvt facets you can use to do UTF-16 or UTF-8 input/output; codecvt_utf8, codecvt_utf16, codecvt_utf8_utf16.

codecvt_utf8 will convert between external UTF-8 multi-byte sequences and internal UTF-32/UCS4 or UCS2 data. codecvt_utf16 will convert between external UTF-16 multi-byte sequences and internal UTF-32/UCS4 or UCS2 data. codecvt_utf8_utf16 will convert between external UTF-8 multi-byte sequences and internal UTF-16 data.

There's no built-in way to convert between external UTF-16 multi-byte sequences and internal UTF-16 data, which is what you'd want when using UTF-16 encoded wchar_t strings internally and UTF-16 encoded files externally.

But since you indicated that UTF-8 output was acceptable the codecvt_utf8_utf16 facet will work well.

#include <fstream>
#include <codecvt>

int main() {
    std::wofstream mystream("test.txt");
    mystream.imbue(std::locale(std::locale(),
                   new std::codecvt_utf8_utf16<wchar_t, 0x10ffff, std::codecvt_mode(std::consume_header|std::generate_header)>));
    mystream << "Hello, World!\n";
}

Also note that this example sets options on the codecvt_utf8_utf16 facet to generate and read the so-called 'UTF-8 BOM'. This is a Microsoft convention for guessing at a file's encoding and is generally inappropriate on other platforms.

The following is irrelevant to the question at hand, but lifetime management of facets is not like most other modern C++ lifetime management.

Facets are reference counted and when the last locale that has a particular facet is destroyed, the facet is deleted, unless that has been specifically disabled by constructing the facet with a refs parameter of 1. The above example code leaves lifetime management to the locale, and consequently looks similar to a memory leak. The code is correct, however. In terms of exception safety, the only code that can potentially run between a successful allocation and ownership of the allocated object being assumed by the locale is the expression std::locale() which is declared noexcept.

Another option is to use a facet which is not managed by the locale and to simply ensure that it outlives the locale and all copies. Using a facet with static storage duration is simple, but remember to indicate that locales should not delete the facet by setting its reference count to 1.

static std::codecvt_utf8_utf16<wchar_t, 0x10ffff, std::codecvt_mode(std::consume_header|std::generate_header)> mycodecvt(1);
mystream.imbue(std::locale(std::locale(), mycodecvt));

If the locale only exists for a short time in a specific scope then you can use a normal, local variable. This is the same as the above but without static. Just make sure the locale (as well as every copy) is destroyed before the facet goes out of scope.

This is one time where a smart pointer does not make things better, because the hand-off of ownership to a smart pointer-oblivious object is tricky. You have to figure out how to manually handle exceptions that occur after the locale has received the facet and therefore has taken ownership but before the smart pointer relinquishes ownership.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow