UTF-16 codecvt facet
Question
Extending from this questions about locales
And described in this question: What I really wanted to do was install a codecvt facet into the locale that understands UTF-16 files.
I could write my own. But I am not a UTF expert and as such I am sure I would get it nearly correct; but it would break at the most inconvenient time. So I was wondering if there are any resources (on the web) of pre-build codecvt (or other) facets that can be used from C++ that are peer reviewed and tested?
The reason is the default locale (on my system MAC OS X 10.6) when reading a file just converts 1 byte to 1 wchar_t with no conversion. Thus UTF-16 encoded files are converted into wstrings that contain lots of null ('\0') characters.
Solution
I'm not sure if by "resources on the Web" you meant available free of cost, but there is the Dinkumware Conversions Library that sounds like it will fit your needs—provided that the library can be integrated into your compiler suite.
The codecvt
types are described in the section Code Conversions.
OTHER TIPS
As of C++11, there are additional standard codecvt
specialisations and types, intended for converting between various UTF-x and UCSx character sequences; one of these may suit your needs.
In <locale>
:
std::codecvt<char16_t, char, std::mbstate_t>
: Converts between UTF-16 and UTF-8.std::codecvt<char32_t, char, std::mbstate_t>
: Converts between UTF-32 and UTF-8.
In <codecvt>
:
std::codecvt_utf8_utf16<typename Elem>
: Converts between UTF-8 and UTF-16, where UTF-16 code points are stored as the specifiedElem
(note that ifchar32_t
is specified, only one code point will be stored perchar32_t
).- Has two additional, defaulted template paramters (
unsigned long MaxCode = 0x10ffff
, andstd::codecvt_mode Mode = (std::codecvt_mode)0
), and inherits fromstd::codecvt<Elem, char, std::mbstate_t>
.
- Has two additional, defaulted template paramters (
std::codecvt_utf8<typename Elem>
: Converts between UTF-8 and either UCS2 or UCS4, depending onElem
(UCS2 forchar16_t
, UCS4 forchar32_t
, platform-dependent forwchar_t
).- Has two additional, defaulted template paramters (
unsigned long MaxCode = 0x10ffff
, andstd::codecvt_mode Mode = (std::codecvt_mode)0
), and inherits fromstd::codecvt<Elem, char, std::mbstate_t>
.
- Has two additional, defaulted template paramters (
std::codecvt_utf16<typename Elem>
: Converts between UTF-16 and either UCS2 or UCS4, depending onElem
(UCS2 forchar16_t
, UCS4 forchar32_t
, platform-dependent forwchar_t
).- Has two additional, defaulted template paramters (
unsigned long MaxCode = 0x10ffff
, andstd::codecvt_mode Mode = (std::codecvt_mode)0
), and inherits fromstd::codecvt<Elem, char, std::mbstate_t>
.
- Has two additional, defaulted template paramters (
codecvt_utf8
and codecvt_utf16
will convert between the specified UTF and either UCS2 or UCS4, depending on the size of Elem
. Therefore, wchar_t
will specify UCS2 on systems where it's 16- to 31-bit (such as Windows, where it's 16-bit), or UCS4 on systems where it's at least 32-bit (such as Linux, where it's 32-bit), regardless of whether wchar_t
strings actually use that encoding; on platforms that use different encodings for wchar_t
strings, this will understandably cause problems if you aren't careful.
For more information, see CPP Reference:
Note that support for header codecvt
was only added to libstdc++
relatively recently. If using an older version of Clang or GCC, you may have to use libc++
, if you want to use it.
Note that versions of Visual Studio prior to 2015 don't actually support char16_t
and char32_t
; if these types exist on previous versions, it will be as typedefs for unsigned short
and unsigned int
, respectively. Also note that older versions of Visual Studio can have trouble converting strings between UTF encodings sometimes, and that Visual Studio 2015 has a glitch that prevents codecvt
from working properly with char16_t
and char32_t
, requiring the use of same-sized integral types instead