Read/Store different types of strings (utf8/utf16/ansi)

Question 1

After some research, tries and errors, I decided to go with UTF8-CPP, which is a lightweight, header-only set of functions for converting to/from utf8. It includes functions for converting from utf-16 to utf-8 and, from my understanding, can deal correctly with BOM.

Then I store all strings as std::string, converting utf-16 strings to utf-8, something like this (from my example above):

int length; char encoding; char* bytes;

string value;
switch(encoding) {
    case 0x00:
    case 0x03:
        value = string(bytes, length);
        break;
    case 0x01:
    case 0x02:
        vector<unsigned char> utf8;
        wchar_t* input = (wchar_t*)bytes;
        utf16to8(input, input + (length >> 1), back_inserter(utf8));
        value = string(utf8.start(), utf8.end());
        break;
    default:
        throw ERROR_INVALID_STRING_ENCODING;
}

This works fine in my quick test. I'll need to do more testing before final judgement.

Question 2

UTF16 need to be distinguished between LE vs BE.

I suspect 0x02 - utf-16 without BOM (wchar_t* directly) is the actually UTF16 BE. With BOM encoding means LE/BE is indicated by the BOM.

Unicode support of C++ Standard Library is very limited, and I don't think vanilla C++ will handle UTF16LE/BE properly, not to mention of UTF8. Many Unicode applications use 3rd party support libraries such as ICU.

For in-memory representation, I would stick to std::string. Because std::string can represents any text encoding and std::wstring is not much helpful to this multiple encoding situation. If you need to use std::wstring and related std::iostream functions, be careful with system locale and std::locale settings.

Mac OS X uses UTF8 as the only default text encoding whereas Windows uses UTF16 LE. You also need only one text encoding internally, plus several converting functions will do you purpose, I think.