Question

I'd like to read some text from a file that uses UTF-8 encoding and convert it to UTF-16, using std::wifstream, something like this:

//
// Read UTF-8 text and convert to UTF-16
//
std::wifstream src;
src.imbue(std::locale("???"));          // UTF-8 ???
src.open("some_text_file_using_utf8");
std::wstring line;                      // UTF-16 string
while (std::getline(src, line))
{
    ... do something processing the UTF-16 string ...
}

Is there a standard locale name for the UTF-8 conversion?
Is it possible to achieve that goal using std::locale?

I'm using Visual Studio 2013.


NOTE:

I know that I/O streams tend to be slow, and it's possible to use Win32 memory mapped files for faster reading, and MultiByteToWideChar() Win32 API for the conversion, etc.
But for this particular case I'd like a solution that only uses standard C++ and its standard library, without Boost.

If the C++ standard library just can't do that, the second option would be to use Boost; in this case, which Boost library should I use?

Was it helpful?

Solution

This works on Windows with Visual Studio, I think as far back as VS2010

#include <locale>  // consume_header, locale
#include <codecvt> // codecvt_utf8_utf16

src.imbue(std::locale(
    src.getloc(),
    new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header>));

Since Windows uses a 16-bit wchar_t and also universally uses UTF-16 as the wide character encoding this works great in that environment. (And because I'm assuming a Windows environment my example includes consume_header to handle Windows' convention of adding a header to UTF-8 data).

On other platforms wchar_t is generally 32-bit and, while you can store UTF-16 code unit values in such 32-bit code units, nothing else will be written expecting such a thing. On a platform with 32-bit wchar_t you might prefer to use std::codecvt_utf8<wchar_t> to produce UTF-32 wide strings.


For portability ideally what you'd want is a codecvt facet that knows how to convert from UTF-8 to either the locale's wchar_t encoding or the wide execution encoding. The problem with that, however, is that there's no requirement for any wide encoding to support the entire range of characters representable in UTF-8. The bottom line is that wchar_t isn't particularly useful for portable code as specified.

However one trick that might be useful if you're sticking to platforms that use UTF-16 or UTF-32 depending on the size of wchar_t is:

template <int N> struct get_codecvt_utf8_wchar_impl;
template <> struct get_codecvt_utf8_wchar_impl<16> {
  using type = std::codecvt_utf8_utf16<wchar_t>;
};
template <> struct get_codecvt_utf8_wchar_impl<32> {
  using type = std::codecvt_utf8<wchar_t>;
};

using codecvt_utf8_wchar = get_codecvt_utf8_wchar_impl<
    sizeof(wchar_t) * CHAR_BIT>::type;

src.imbue(std::locale(src.getloc(), new codecvt_utf8_wchar));

You can also use char16_t and char32_t, which would lend themselves to portable code, however the standard is missing a few bits to make iostreams usable with these character types and also implementations don't fully support what is specified.

VS I think still implements char16_t and char32_t as typedefs and so the template specializations using them don't work (even though the specializations do exist if you look in the headers, they're just ifdef'd out because the compiler can't handle them). libstdc++ doesn't implement the template specializations yet even though it supports char16_t and char32_t as real types. The most complete implementation I know of is libc++ with a suitable compiler (gcc or clang), but even that is still missing the <cuchar> header.

Since implementation support is limited that sort of prevents portable code from doing much with these besides using them as a consistent representation in user code across platforms (though that is useful even on its own).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top