Converting a UTF-8 text to wchar_t

Question 1

C does not define what encoding the char and wchar_t types are and the standard library only mandates some functions that translate between the two without saying how. If the implementation-dependent encoding of char is not UTF-8 then mbstowcs will result in data corruption.

As noted in the rationale for the C99 standard:

However, the five functions are often too restrictive and too primitive to develop portable international programs that manage characters.

...

C90 deliberately chose not to invent a more complete multibyte- and wide-character library, choosing instead to await their natural development as the C community acquired more experience with wide characters.

Sourced from here.

So, if you have UTF-8 data in your chars there isn't a standard API way to convert that to wchar_ts.

In my opinion wchar_t should usually be avoided unless necessary - you might need it if you're using WIN32 APIs for example. I am not convinced it will simplify string manipulation. wchar_t is always UTF-16LE on Windows so you may still need to have more than one wchar_t to represent a single Unicode code point anyway.

I suggest you investigate the ICU project - at least from an educational standpoint.

Question 2

Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers?

You could do that with conditional typedefs like this:

#if defined(__STDC_UTF_16__)
   typedef _Char16_t CHAR16;
#elif defined(_WIN32)
   typedef wchar_t   CHAR16;
#else
   typedef uint16_t  CHAR16;
#endif

#if defined(__STDC_UTF_32__)
   typedef _Char32_t CHAR32;
#elif defined(__STDC_ISO_10646__)
   typedef wchar_t   CHAR32;
#else
   typedef uint32_t  CHAR32;
#endif

This will define the typedefs CHAR16 and CHAR32 to use the new C++11 character types if available, but otherwise fall back to using wchar_t when possible and fixed-width unsigned integers otherwise.