Question

This is a follow up question to a previous one

The issue in that question has been resolved, and now the code is proceeding as expected, however the final output of the utf-8 to ucs2 conversion is gibberish. By that I mean the hex values of the final text don't correspond in anyway to the utf-8 version. I know they are different encodings, but there doesn't seem to be any mapping between the two.

The input into the conversion is "ĩ", the output is "ÿþ)^A". In hex the values are c4a9 for "ĩ" (utf-8 value) and "00FF 00FE 0029 0001" for "ÿþ)^A" (ucs2 values).

I'm hoping someone has an explanation for this behavior or can tell me what I've done incorrectly in the code.

The new updated code is:

UErrorCode resultCode = U_ZERO_ERROR;

UConverter* pLatinOneConv = ucnv_open("ISO-8859-1", &resultCode);

// Change the callback to error out instead of the default            
const void* oldContext;
UConverterFromUCallback oldFromAction;
UConverterToUCallback oldToAction;
ucnv_setFromUCallBack(pLatinOneConv, UCNV_FROU_CALLBACK_STOP, NULL, &oldFromAction, &oldContext, &resultCode);
ucnv_setToUCallBack(pLatinOneConv, UCNV_TO_U_CALLBACK_STOP, NULL, &oldToAction, &oldContext, &resultCode);

int32_t outputLength = 0;
int bodySize = uniString.length();
int targetSize = bodySize * 4;
char* target = new char[targetSize];                       

printf("Body: %s\n", uniString.c_str());
if (U_SUCCESS(resultCode))
{
    outputLength = ucnv_fromAlgorithmic(pLatinOneConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
        uniString.length(), &resultCode);
    ucnv_close(pLatinOneConv);
}
printf("ISO-8859-1 just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
    outputLength ? target : "invalid_char", resultCode, outputLength);

if (resultCode == U_INVALID_CHAR_FOUND || resultCode == U_ILLEGAL_CHAR_FOUND || resultCode == U_TRUNCATED_CHAR_FOUND)
{
    if (resultCode == U_INVALID_CHAR_FOUND)
    {
        resultCode = U_ZERO_ERROR;
        printf("Unmapped input character, cannot be converted to Latin1");                    
        // segment Text, if necessary, and add UUIDs copy existing pPdu's addresses and optionals
        UConverter* pUscTwoConv = ucnv_open("UCS-2", &resultCode);
        if (U_SUCCESS(resultCode))
        {
            printf("Text Body: %s\n", uniString.c_str());
            outputLength = ucnv_fromAlgorithmic(pUscTwoConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
                uniString.length(), &resultCode);
            ucnv_close(pUscTwoConv);
        }
        printf("UCS-2 just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
            outputLength ? target : "invalid_char", resultCode, outputLength);

        if (U_SUCCESS(resultCode))
        {
            pdus = SegmentText(target, pPdu, SEGMENT_SIZE_UNICODE_MAX, true);
        }
    }
    else
    {
        printf("DecodeText(): Text contents does not appear to be valid UTF-8");
    }
}
else
{
    printf("DecodeText(): Text successfully converted to Latin1");
    std::string newBody(target, outputLength);
    pdus = SegmentText(newBody, pPdu, SEGMENT_SIZE_MAX);
}
Was it helpful?

Solution

The ICU conversion gives you correct results, but you don't quite know what to do with them, and successfully convert them to gibberish. Here are the things you are doing wrong, more or less in order.

One

You print non-Latin-1 data on a system that is (as available evidence suggests) natively working in Latin-1.

This is not so bad when you print UTF-8, because UTF-8 is designed not to break things that work with 8-bit character data too hard. You will see gibberish, but at least you will see all of your data and will be able to convert it back to something sensible.

UTF-16 (which by the way superseded UCS-2 back in 1996) is not so kind. A UTF-16 encoded string contains code units which are two bytes long. Either of these two bytes is quite capable of being zero. (All ASCII characters encoded as UTF-16 have a zero byte). As long as the other byte is non-zero, the entire character is non-NULL. Your printf, strlen and so on, however, have no idea there is the other byte. They think you are feeding them Latin-1, and they will stop at the first zero byte (which they interpret as the NULL character).

Luckily for you, ĩ character doesn't have a zero byte in its UTF-16 encoding, so you have got away with it this time.

How to do it correctly? Never printf or fputs, but fwrite/std::ostream::write; never strcpy, always memcpy; never strlen, but always keep the length handy in a separate variable.

Two

You print this data on screen.

Your screen can interpret bytes from (presumably) 0 to 31, and often bytes that follow them, in different and interesting ways. Like moving your cursor, for example, or beeping, or changing text colours. You are printing UTF-16 data that can have absolutely any bytes in its encoding,even if the source contained perfectly ordinary printable Unicode characters. So just about anything can happen.

Luckily again, the single character you have tried to convert doesn't contain harmful bytes in its UTF-16 representation.

How to do it correctly? If you need to print something to take a quick look, print hexadecimal codes for either all or just non-printable characters.

 void print_bytes (FILE* fp, const unsigned char* s, int len,
                    bool escape_all) {
   // note: explicit length, *never* strlen!
   // note: unsigned char, you need it
   int i;
   for (i = 0; i < len; ++i, ++s)
   {
      if (escape_all || ! isprint(*s)) {
        fprintf ("\\x%02x", *s);
      } 
      else {
        fputc(*s, fp);
      }
   }
 }

Three

You look up the Latin-1 characters you've got from your screen on fileinfo, and thus interpret them as if they are Unicode characters, and then take their 16-bit character codes (one 16-bit code per characters) and interpret them as if they were bytes.

Don't have much to say about it. Just don't do that. You have a function that prints bytes in a readable hexadecimal representation. Use it. Alternatively, use any number of freely available program that display or even let you edit such a representation.

Which is not to say you shouldn't use fileinfo, of course. Do it right, which basically means know what your encoding is, and how any given encoding of a character is different from (though sometimes similar to) its Unicode code point.

Four

This paragraph is not about mistakes per se but rather about developer's intuition (or lack thereof) that does not correspond to any code you have posted.

Despite all of the above mistakes you have managed to get data which is almost good. You have 00 in all the even places, which could mean something is wrong with your integer bit size and you need to get rid of these zeros. After have done that, you are left with FFFE as the first two bytes, which you should have recognised as a BOM. You suspect you have an endianness issue, but you have not tried to resolve it by varying the UTF-16 flavour (UTF-16LE vs UTF-16BE).

These are thing any Unicode developer should be able to apply almost instinctively.


Unicode is big and complex, much more complex than most people realize. This is only the very beginning of the very beginning.


Please suggest improvements for this answer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top