Sadly, this is a huge topic that deserves a small book devoted to it. And that book would basically need a specialized chapter for every target platform one wished to build for (Linux, Windows [flavor], Mac, etc.).
My answer is only going to cover Windows desktop applications, compiled for C++ with or without MFC. Please Note: this pertains to wanting to read in and write out MBCS (narrow) files from a UNICODE build using the system default code page (i.e. the code page for non-Unicode software). If you want to read and write Unicode files from a UNICODE build, you must open the files in binary mode, and you must handle BOM and line feed conversions manually (i.e. on input, you must skip the BOM (if any), and both convert the external encoding to Windows Unicode [i.e. UTF-16LE] as well as convert any CR+LF sequences to LF only; and for output, you must write the BOM (if any), and convert from UTF-16LE to whatever target encoding you want, plus you must convert LF to CR+LF sequences for it to be a properly formatted PC text file).
BEWARE of MS's std C library's puts and gets and fwrite and so on, which if opened in text/translated mode, will convert any 0x0D to a 0x0A 0x0D sequence on write, and vice verse on read, regardless of whether you're reading or writing a single byte, or a wide character, or a stream of random binary data -- it doesn't care, and all of these functions boil down to doing blind byte-conversions in text/translated mode!!!
Also be aware that many of the Windows API functions use CP_ACP internally, without any external control over their behavior (e.g. WritePrivateProfileString()
). Hence the reason one might want to ensure that all libraries are operating with the same character locale: CP_ACP and not some other one, since you can't control some of the functions behaviors, you're forced to conform to their choice or not use them at all.
If using MFC, one needs to:
// force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
// this makes MFC's CString and CStdioFile and other interfaces use the
// system default code page, instead of the thread default code page (which is normally "c")
#define _CONVERSION_DONT_USE_THREAD_LOCALE
For C++ and C libraries, one must tell the libraries to use the system code page:
// force C++ and C libraries based on setlocale() to use system locale for narrow strings
// (this automatically calls setlocale() which makes the C library do the same thing as C++ std lib)
// we only change the LC_CTYPE, not collation or date/time formatting
std::locale::global(std::locale(str(boost::format(".%||") % GetACP()).c_str(), LC_CTYPE));
I do the #define
in all of my precompiled headers, before including any other headers. I set the global locale in main (or its moral equivalent), once for the entire program (you may need to call this for every thread that is going to do I/O or string conversions).
The build target is UNICODE, and for most of our I/O, we use explicit string conversions before outputting via CStringA(my_wide_string)
.
One other thing that one should be aware of, there are two different sets of multibyte functions in the C standard library under VS C++ - those which use the thread's locale for their operations, and another set which use something called the _setmbcp()
(which you can query via _getmbcp()
. This is the actual code page (not a locale) that is used for all narrow string interpretation (NOTE: this is always initialized to CP_ACP
, i.e. GetACP()
by the VS C++ startup code).
Useful reference materials:
- the-secret-family-split-in-windows-code-page-functions
- Sorting it all out (explains that there are four different locales in effect in Windows)
- MS offers some functions that allow you to set the encoding to use directly, but I didn't explore them
- An important note about a change to MFC that caused it to no longer respect CP_ACP, but rather CP_THREAD_ACP by default starting in MFC 7.0
- Exploration of why console apps in Windows are extreme FAIL when it comes to Unicode I/O
- MFC/ATL narrow/wide string conversion macros (which I don't use, but you may find useful)
- Byte order marker, which you need to write out for Unicode files of any encoding to be understood by other Windows software