Question

I'm parsing a file that among other things contains various strings in different encodings. The way these strings are stored is this:

0xFF 0xFF - block header                   2 bytes
0xXX 0xXX - length in bytes                2 bytes
0xXX      - encoding (can be 0, 1, 2, 3)   1 byte
...       - actual string                  num bytes per length

This is generally quite easy, however I'm not sure how to deal with encodings. Encoding can be one of:

0x00 - regular ascii string (that is, actual bytes represent char*)
0x01 - utf-16 with BOM (wchar_t* with the first two bytes being 0xFF 0xFE or 0xFE 0xFF)
0x02 - utf-16 without BOM (wchar_t* directly)
0x03 - utf-8 encoded string (char* to utf-8 strings)

I need to read/store this somehow. Initially I was thinking on simple string but that wouldn't work with wchar_t*. Then I thought about converting everything to wstring, yet this would be quite a bit of unnecessary conversion. The next thing came to mind was boost::variant<string, wstring> (I'm already using boost::variant in another place in the code). This seems to me to be a reasonable choice. So now I'm a bit stuck with parsing it. I'm thinking somewhere along these lines:

//after reading the bytes, I have these:
int length;
char encoding;
char* bytes;

boost::variant<string, wstring> value;
switch(encoding) {
    case 0x00:
    case 0x03:
        value = string(bytes, length);
        break;
    case 0x01:
        value = wstring(??);
        //how do I use BOM in creating the wstring?
        break;
    case 0x02:
        value = wstring(bytes, length >> 1);
        break;
    default:
        throw ERROR_INVALID_STRING_ENCODING;
}

As I do little more than print these strings later, I can store UTF8 in a simple string without too much bother.

The two questions I have are:

  1. Is such approach a reasonable one (i.e. using boost::variant)?

  2. How do I create wstring with a specific BOM?

Was it helpful?

Solution 2

After some research, tries and errors, I decided to go with UTF8-CPP, which is a lightweight, header-only set of functions for converting to/from utf8. It includes functions for converting from utf-16 to utf-8 and, from my understanding, can deal correctly with BOM.

Then I store all strings as std::string, converting utf-16 strings to utf-8, something like this (from my example above):

int length; char encoding; char* bytes;

string value;
switch(encoding) {
    case 0x00:
    case 0x03:
        value = string(bytes, length);
        break;
    case 0x01:
    case 0x02:
        vector<unsigned char> utf8;
        wchar_t* input = (wchar_t*)bytes;
        utf16to8(input, input + (length >> 1), back_inserter(utf8));
        value = string(utf8.start(), utf8.end());
        break;
    default:
        throw ERROR_INVALID_STRING_ENCODING;
}

This works fine in my quick test. I'll need to do more testing before final judgement.

OTHER TIPS

UTF16 need to be distinguished between LE vs BE.

I suspect 0x02 - utf-16 without BOM (wchar_t* directly) is the actually UTF16 BE. With BOM encoding means LE/BE is indicated by the BOM.

Unicode support of C++ Standard Library is very limited, and I don't think vanilla C++ will handle UTF16LE/BE properly, not to mention of UTF8. Many Unicode applications use 3rd party support libraries such as ICU.

For in-memory representation, I would stick to std::string. Because std::string can represents any text encoding and std::wstring is not much helpful to this multiple encoding situation. If you need to use std::wstring and related std::iostream functions, be careful with system locale and std::locale settings.

Mac OS X uses UTF8 as the only default text encoding whereas Windows uses UTF16 LE. You also need only one text encoding internally, plus several converting functions will do you purpose, I think.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top