C does not define what encoding the char
and wchar_t
types are and the standard library only mandates some functions that translate between the two without saying how. If the implementation-dependent encoding of char
is not UTF-8 then mbstowcs
will result in data corruption.
As noted in the rationale for the C99 standard:
However, the five functions are often too restrictive and too primitive to develop portable international programs that manage characters.
...
C90 deliberately chose not to invent a more complete multibyte- and wide-character library, choosing instead to await their natural development as the C community acquired more experience with wide characters.
Sourced from here.
So, if you have UTF-8 data in your char
s there isn't a standard API way to convert that to wchar_t
s.
In my opinion wchar_t
should usually be avoided unless necessary - you might need it if you're using WIN32 APIs for example. I am not convinced it will simplify string manipulation. wchar_t
is always UTF-16LE on Windows so you may still need to have more than one wchar_t
to represent a single Unicode code point anyway.
I suggest you investigate the ICU project - at least from an educational standpoint.