Standard way in C11 and C++11 to convert UTF-8?

https://stackoverflow.com/questions/19649192

01-07-2022
|

Frage

C11 and C++11 both introduce the uchar.h/cuchar header defining char16_t and char32_t as explicitly 16 and 32 bit wide characters, added literal syntax u"" and U"" for writing strings with these character types, along with macros __STDC_UTF_16__ and __STDC_UTF_32__ that tell you whether or not they correspond to UTF-16 and UTF-32 code units. This helps remove the ambiguity about wchar_t, which on some platforms was 16 bit and generally used to hold UTF-16 code units, and on some platforms was 32 bit and generally used to hold UTF-32 code units; assuming those macros are now set, you can now write portable, unambiguous code referring to UTF-16 and UTF-32. __STDC_ISO_10646__ can also be used as a proxy to determine whether wchar_t is capable of holding UTF-32 values; if it can't, you can't necessarily assume that it holds UTF-16, but it's probably a close enough approximation to be portable.

They also add the functions mbrtoc16, mbrtoc32, c16rtomb, and c32rtomb for converting between multibyte characters and these types. Between these and the existing mbstowcs family of functions, it's possible to translate between UTF-16, UTF-32, the platform multibyte character set, and the platform wide character set portably (though not necessarily losslessly unless the platform defined multibyte and wide character sets are UTFs; in particular, it seems like these functions will be fairly useless on Windows where the locale defined multibyte encoding is not allowed to use more than two bytes per character).

Furthermore, they added the u8"" syntax for writing literal UTF-8 encoded strings. As UTF-8 is an encoding that is compatible with most functions that deal in char * and std::string, this is one of the most useful new additions.

However, they seem to have failed to add any way to portably convert between UTF-8, UTF-16, and UTF-32. The mbtoc16 and related functions convert between the implementation defined multibyte encoding and UTF-16 or 32; but you can't depend on this being UTF-8. On Unix-like platforms it's dependent on the locale, and many of them use UTF-8 in their locale by default, and even if it's not the default you can at least set the locale to a UTF-8 locale for the purposes of knowing that "multibyte" means UTF-8. On Windows, however, you explicitly can't use UTF-8 or any other encoding that requires more than two bytes for the locale.

Am I just missing something, or is the UTF-8 string type not accompanied by any way to convert it to the other types of strings: platform defined multibyte, platform defined wide char, UTF-16, or UTF-32? Is there no way to even tell if your system multibyte encoding is UTF-8? Is there any reason why this support wasn't included (specifically, I'm looking for actually written justification or discussion by the C or C++ standards committees, not just speculation)? Is there any work being done to improve this situation; is it likely to improve in the future?

Or, is the current best solution, if you want to support UTF-8 in a portable fashion, to write your own implementation, pull in a library dependency, or use platform-specific functions like iconv and MultiByteToWideChar?

Lösung

Sounds like you're looking for the std::codecvt type. See the example on that page for usage.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow