Question

I ask this question in the light of the innovations that C++11 brings, namely uchar16_t/u16string.

I write an application that should have multilingual support. According to my plan the localization strings will be stored in XML as UTF-16, and retrieved with pugixml. THe strings will be used both for the GUI and generating HTML report of the computation results. Since I have understood wchar_t/wstring as being deprecated in favour of new u16string, I've planned to use u16string for storing language strings inside the program. But since both pugixml and MFC's CString use wchar_t as underlining storage type for the Unicode, should I perhaps forget about u16string for now and instead use straightforwardly wstring?

Language-portability is crucial, platform portability doesn't matter.

I use MVS 2013 with Intel compiler.

Was it helpful?

Solution

The encoding used for storing the data outside the program is the only one that matters.

That data is likely to be used from other software. Someone will want to write those strings and they'll probably use some kind of specialised editor or gasp a general-purpose text editor. UTF-8 has much better support from other software than UTF-16, and that's what I would recommend and why.

Inside the program, what encoding you use doesn't matter, as long as you do it consistently and don't mix them up in stupid ways.

Obviously, if you use the same encoding inside the program as you do outside of it, you don't need to perform any conversions and the risk of mixing them up and producing mojibake is not there.

The thing with pugixml using wchar_t is that the encoding it uses then depends on the size of wchar_t. If the size is 2, it uses UTF-16; if the size is 4 it uses UTF-32. pugixml also has the option to use UTF-8 with char by setting the PUGIXML_WCHAR_MODE macro appropriately, so you can use that instead.

If you use wchar_t API, stick to wstring. Remember: since we're inside the program, it doesn't matter if it's going to be UTF-16 or UTF-32, as long as we're consistent. If you use the char API, stick to string. You could, I guess, perform conversions from wchar_t to char16_t and use u16strings, but that wouldn't give much benefit.

The saving and loading functions in pugixml take an xml_encoding parameter that lets you pick what encoding will be on the data outside the program, and that doesn't have to match what you use internally. Pick whichever you find the most convenient.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top