MultiByteToWideChar does not recognize some Korean characters

https://stackoverflow.com/questions/15900985

02-04-2022
|

Question

This Korean text (quoted-printable) "2013-03-22 =0E?@HD=0F 05:30" does not properly get converted by MultiByteToWideChar to Unicode. Quoted-printable form here is just for placing this text here, the actual contents contain 0xE and 0xF bytes.

MultiByteToWideChar(50225, 0, bs.pData, bs.nSize, pData + nSize, nConvertedLen);

=0E?@HD=0F gets converted as-is, and the resulting Unicode contains 0xE and 0xF ASCII characters. However, I found that a couple of Korean chars should appear there instead of these chars. I've always thought that international character sequences start with a byte with the code greater than 127 but recently found that it's not true. However, MultiByteToWideChar still thinks the way I did and refuses to treat 0xE ? @ H D 0xF as a couple of non-ASCII Korean chars of 50225 (or 949) codepage. When I do the same on the same computer using .NET functions (like Encoding.GetEncoding(50255).GetString), I get the conversion results correctly and Korean chars are there. But MultiByteToWideChar does not work. I tried different flags you can set for MultiByteToWideChar (MB_COMPOSITE, etc) but still no luck.

How can get MultiByteToWideChar to work properly? If it matters, I'm on WinXP SP3. Again, .NET way is working fine, and internally Encoding.GetString seems to call MultiByteToWideChar.

Solution

This is a known issue. The root cause is an inconsistent use of SHIFT IN (0x0E) and SHIFT OUT (0x0F) in 50225. They're not used as encoding shifts.

It's important to understand that these bytes are not characters themselves. Code page 50225 is not an ordinary multi-byte encoding like e.g. UTF-8. UTF-8 is stateless; the same byte sequence always decodes to the same Unicode. The decoding of a byte sequence in 50255 depends on bytes consumed earlier, in particular 0x0E and 0x0F.

The advice given makes a lot of sense. Use any sane Unicode encoding. (Personally, I'd advise UTF-8).

OTHER TIPS

Instead of using MultiByteToWideChar I suggest to use IMultiLanguage::ConvertStringToUnicode instead, which is suggested by Microsoft and decodes the characters properly. The only "downside" is that it requires Windows XP where MultiByteToWideChar works on Windows 2000. Not a huge downside IMO.

IMultiLanguage also has some other tools to make encoding conversion easier such as IMultiLanguage::GetCharsetInfo or IMultiLanguage::EnumCodePages.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow