Question

My question seems to have confused folks. Here's something concrete:

Our code does the following:

FILE * fout = _tfsopen(_T("丸穴種類.txt"), _T("w"), _SH_DENYNO);
_fputts(W2T(L"刃物種類\n"), fout);
fclose(fout);

Under MBCS build target, the above produces a properly encoded file for code page 932 (assuming that 932 was the system default code page when this was run).

Under UNICODE build target, the above produces a garbage file full of ????.

I want to define a symbol, or use a compiler switch, or include a special header, or link to a given library, to make the above continue to work when the build target is UNICODE without changing the source code.

Here's the question as it used to exist:

FILE* streams can be opened in t(ranslated) or b(inary) modes. Desktop applications can be compiled for UNICODE or MBCS (under Windows).

If my application is compiled for MBCS, then writing MBCS strings to a "wt" stream results in a well-formed text file containing MBCS text for the system code page (i.e. the code page "for non Unicode software").

Because our software generally uses the _t versions of most string & stream functions, in MBCS builds output is handled primarily by puts(pszMBString) or something similar putc etc. Since pszMBString is already in the system code page (e.g. 932 when running on a Japanese machine), the string is written out verbatim (although line terminators are massaged by puts and gets automatically).

However, if my application is compiled for UNICODE, then writing MBCS strings to a "wt" stream results in garbage (lots of "?????" characters) (i.e. I convert the UNICODE to the system's default code page and then write that to the stream using, for example, fwrite(pszNarrow, 1, length, stream)).


I can open my streams in binary mode, in which case I'll get the correct MBCS text... but, the line terminators will no longer be PC-style CR+LF, but instead will be UNIX-style LF only. This, because in binary (non-translated) mode, the file stream doesn't handle the LF->CR+LF translation.


But what I really need, is to be able to produce the exact same files I used to be able to produce when compiling for MBCS: correct line terminators and MBCS text files using the system's code page.

Obviously I can manually adjust the line terminators myself and use binary streams. However, this is a very invasive approach, as I now have to find every bit of code throughout the system that writes text files, and alter it so that it does all of this correctly. What blows my mind, is that UNICODE target is stupider / less capable than the MBCS target we used to use! Surely there is a way to toggle the C library to say "output narrow strings as-is but handle line terminators properly, exactly as you'd do in MBCS builds"?!

Was it helpful?

Solution

Sadly, this is a huge topic that deserves a small book devoted to it. And that book would basically need a specialized chapter for every target platform one wished to build for (Linux, Windows [flavor], Mac, etc.).

My answer is only going to cover Windows desktop applications, compiled for C++ with or without MFC. Please Note: this pertains to wanting to read in and write out MBCS (narrow) files from a UNICODE build using the system default code page (i.e. the code page for non-Unicode software). If you want to read and write Unicode files from a UNICODE build, you must open the files in binary mode, and you must handle BOM and line feed conversions manually (i.e. on input, you must skip the BOM (if any), and both convert the external encoding to Windows Unicode [i.e. UTF-16LE] as well as convert any CR+LF sequences to LF only; and for output, you must write the BOM (if any), and convert from UTF-16LE to whatever target encoding you want, plus you must convert LF to CR+LF sequences for it to be a properly formatted PC text file).

BEWARE of MS's std C library's puts and gets and fwrite and so on, which if opened in text/translated mode, will convert any 0x0D to a 0x0A 0x0D sequence on write, and vice verse on read, regardless of whether you're reading or writing a single byte, or a wide character, or a stream of random binary data -- it doesn't care, and all of these functions boil down to doing blind byte-conversions in text/translated mode!!!

Also be aware that many of the Windows API functions use CP_ACP internally, without any external control over their behavior (e.g. WritePrivateProfileString()). Hence the reason one might want to ensure that all libraries are operating with the same character locale: CP_ACP and not some other one, since you can't control some of the functions behaviors, you're forced to conform to their choice or not use them at all.

If using MFC, one needs to:

// force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
// this makes MFC's CString and CStdioFile and other interfaces use the
// system default code page, instead of the thread default code page (which is normally "c")
#define _CONVERSION_DONT_USE_THREAD_LOCALE  

For C++ and C libraries, one must tell the libraries to use the system code page:

// force C++ and C libraries based on setlocale() to use system locale for narrow strings
// (this automatically calls setlocale() which makes the C library do the same thing as C++ std lib)
// we only change the LC_CTYPE, not collation or date/time formatting
std::locale::global(std::locale(str(boost::format(".%||") % GetACP()).c_str(), LC_CTYPE));

I do the #define in all of my precompiled headers, before including any other headers. I set the global locale in main (or its moral equivalent), once for the entire program (you may need to call this for every thread that is going to do I/O or string conversions).

The build target is UNICODE, and for most of our I/O, we use explicit string conversions before outputting via CStringA(my_wide_string).

One other thing that one should be aware of, there are two different sets of multibyte functions in the C standard library under VS C++ - those which use the thread's locale for their operations, and another set which use something called the _setmbcp() (which you can query via _getmbcp(). This is the actual code page (not a locale) that is used for all narrow string interpretation (NOTE: this is always initialized to CP_ACP, i.e. GetACP() by the VS C++ startup code).

Useful reference materials:
- the-secret-family-split-in-windows-code-page-functions
- Sorting it all out (explains that there are four different locales in effect in Windows)
- MS offers some functions that allow you to set the encoding to use directly, but I didn't explore them
- An important note about a change to MFC that caused it to no longer respect CP_ACP, but rather CP_THREAD_ACP by default starting in MFC 7.0
- Exploration of why console apps in Windows are extreme FAIL when it comes to Unicode I/O
- MFC/ATL narrow/wide string conversion macros (which I don't use, but you may find useful)
- Byte order marker, which you need to write out for Unicode files of any encoding to be understood by other Windows software

OTHER TIPS

The C library has support for both narrow (char) and wide (wchar_t) strings. In Windows these two types of strings are called MBCS (or ANSI) and Unicode respectively.

It is fully possible to use the narrow functions even though you have defined _UNICODE. The following code should produce the same output, regardless if _UNICODE is defined or not:

FILE* f = fopen("foo.txt", "wt");
fputs("foo\nbar\n", f);
fclose(f);

In your question you wrote: "I convert the UNICODE to the system's default code page and write that to the stream". This leads me to believe that your wide string contain characters that cannot be converted to the current code page, and thus replacing each of them with a question-mark.

Perhaps you could use some other encoding than the current code page. I recommend using the UTF-8 encoding where ever possible.

Update: Testing your example code on a Windows machine running on code page 1252, the call to _fputts returns -1, indicating an error. errno was set to EILSEQ, which means "Illegal byte sequence". The MSDN documentation for fopen states that:

When a Unicode stream-I/O function operates in text mode (the default), the source or destination stream is assumed to be a sequence of multibyte characters. Therefore, the Unicode stream-input functions convert multibyte characters to wide characters (as if by a call to the mbtowc function). For the same reason, the Unicode stream-output functions convert wide characters to multibyte characters (as if by a call to the wctomb function).

This is key information for this error. wctomb will use the locale for the C standard library. By explicitly setting the locale for the C standard library to code page 932 (Shift JIS), the code ran perfectly and the output was correctly encoded in Shift JIS in the output file.

int main()
{
   setlocale(LC_ALL, ".932");
   FILE * fout = _wfsopen(L"丸穴種類.txt", L"w", _SH_DENYNO);
   fputws(L"刃物種類\n", fout);
   fclose(fout);
}

An alternative (and perhaps preferable) solution to this would be to handle the conversions yourself before calling the narrow string functions of the C standard library.

When you compile for UNICODE, c++ library knows nothing about MBCS. If you say you open the file for outputting text, it will attempt to treat the buffers you pass to it as UNICODE buffers.

Also, MBCS is variable-length encoding. To parse it, c++ library needs to iterate over characters, which is of course impossible when it knows nothing about MBCS. Hence it's impossible to "just handle line terminators correctly".

I would suggest that you either prepare your strings beforehand, or make your own function that writes string to file. Not sure if writing characters one by one would be efficient (measurements required), but if not, you can handle strings piecewise, putting everything that doesn't contain \n in one go.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top