Question

I have a multi-byte string containing a mixture of japanese and latin characters. I'm trying to copy parts of this string to a separate memory location. Since it's a multi-byte string, some of the characters uses one byte and other characters uses two. When copying parts of the string, I must not copy "half" japanese characters. To be able to do this properly, I need to be able to determine where in the multi-byte string characters starts and ends.

As an example, if the string contains 3 characters which requires [2 byte][2 byte][1 byte], I must copy either 2, 4 or 5 bytes to the other location and not 3, since if I were copying 3 I would copy only half the second character.

To figure out where in the multi-byte string characters starts and ends, I'm trying to use the Windows API function CharNext and CharNextExA but without luck. When I use these functions, they navigate through my string one byte at a time, rather than one character at a time. According to MSDN, CharNext is supposed to The CharNext function retrieves a pointer to the next character in a string..

Here's some code to illustrate this problem:

#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>

/* string consisting of six "asian" characters */
wchar_t wcsString[] = L"\u9580\u961c\u9640\u963f\u963b\u9644";

int main() 
{
   // Convert the asian string from wide char to multi-byte.
   LPSTR mbString = new char[1000];
   WideCharToMultiByte( CP_UTF8, 0, wcsString, -1, mbString, 100,  NULL, NULL);

   // Count the number of characters in the string.
   int characterCount = 0;
   LPSTR currentCharacter = mbString;
   while (*currentCharacter)
   {
      characterCount++;

     currentCharacter = CharNextExA(CP_UTF8, currentCharacter, 0);
   }
}

(please ignore memory leak and failure to do error checking.)

Now, in the example above I would expect that characterCount becomes 6, since that's the number of characters in the asian string. But instead, characterCount becomes 18 because mbString contains 18 characters:

門阜陀阿阻附

I don't understand how it's supposed to work. How is CharNext supposed to know whether "é–€é" in the string is an encoded version of a Japanese character, or in fact the characters é – € and é?

Some notes:

  • I've read Joels blog post about what every developer needs to know about Unicode. I may have misunderstood something in it though.
  • If all I wanted to do was to count the characters, I could count the characters in the asian string directly. Keep in mind that my real goal is copying parts of the multi-byte string to a separate location. The separate location only supports multi-byte, not widechar.
  • If I convert the content of mbString back to wide char using MultiByteToWideChar, I get the correct string (門阜陀阿阻附), which indicates that there's nothing wrong with mbString.

EDIT: Apparantly the CharNext functions doesn't support UTF-8 but Microsoft forgot to document that. I threw/copiedpasted together my own routine, which I won't use and which needs improving. I'm guessing it's easily crashable.

  LPSTR CharMoveNext(LPSTR szString)
  {
     if (szString == 0 || *szString == 0)
        return 0;

     if ( (szString[0] & 0x80) == 0x00)
        return szString + 1;
     else if ( (szString[0] & 0xE0) == 0xC0)
        return szString + 2;
     else if ( (szString[0] & 0xF0) == 0xE0)
        return szString + 3;
     else if ( (szString[0] & 0xF8) == 0xF0)
        return szString + 4;
     else
        return szString +1;
  }
Was it helpful?

Solution

Here is a really good explanation of what is going on here at the Sorting it All Out blog: Is CharNextExA broken?. In short, CharNext is not designed to work with UTF8 strings.

OTHER TIPS

As far as I can determine (google and experimentation), CharNextExA doesn't actually work with UTF-8, only supported multibyte encodings that use shorter lead/trail byte pairs or single byte characters.

UTF-8 is a fairly regular encoding, there are a lot of libraries that will do what you want but it's also fairly easy to roll your own.

Have a look in here unicode.org, particularly table 3-7 for valid sequence forms.

const char* NextUtf8( const char* in )
{
    if( in == NULL || *in == '\0' )
        return in;

    unsigned char uc = static_cast<unsigned char>(*in);

    if( uc < 0x80 )
    {
        return in + 1;
    }
    else if( uc < 0xc2 )
    {
         // throw error? invalid lead byte
    }
    else if( uc < 0xe0 )
    {
        // check in[1] for validity( 0x80 .. 0xBF )
        return in + 2;
    }
    else if( uc < 0xe1 )
    {
        // check in[1] for validity( 0xA0 .. 0xBF )
        // check in[2] for validity( 0x80 .. 0xBF )
        return in + 3;
    }
    else // ... etc.
    // ...
}

Given that CharNextExA doesn't work with UTF-8, you can parse it yourself. Just skip over the characters that have 10 in the top two bits. You can see the pattern in the definition of UTF-8: http://en.wikipedia.org/wiki/Utf-8

LPSTR CharMoveNext(LPSTR szString)
{
    ++szString;
    while ((*szString & 0xc0) == 0x80)
        ++szString;
    return szString;
}

This isn't a direct answer to your question, but you may find the following tutorial helpful, I certainly did. In fact the information provided here is enough that you should be able to traverse the multi-byte string yourself with ease:

Complete String Tutorial

Try using 932 for the code page. I don't think CP_UTF8 is a real codepage, and it may only work for WideCharToMultibyte() and back. You can also try isleadByte(), but that requires either setting the locale correctly, or setting the default codepage correctly. I have successfully used IsDBCSLeadByteEx(), but never with CP_UTF8.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top