Unicode <-> Multibyte conversion (native vs. managed)

https://stackoverflow.com/questions/10722193

10-06-2021
|

Question

I'm trying to convert unicode strings coming from .NET to native C++ so that I can write them to a text file. The process shall then be reversed, so that the text from the file is read and converted to a managed unicode string.

I use the following code:

String^ FromNativeToDotNet(std::string value)
{
  // Convert an ASCII string to a Unicode String
  std::wstring wstrTo;
  wchar_t *wszTo = new wchar_t[lvalue.length() + 1];
  wszTo[lvalue.size()] = L'\0';
  MultiByteToWideChar(CP_UTF8, 0, value.c_str(), -1, wszTo, (int)value.length());
  wstrTo = wszTo;
  delete[] wszTo;

  return gcnew String(wstrTo.c_str());
}


std::string FromDotNetToNative(String^ value)
{ 
  // Pass on changes to native part
  pin_ptr<const wchar_t> wcValue = SafePtrToStringChars(value);
  std::wstring wsValue( wcValue );

  // Convert a Unicode string to an ASCII string
  std::string strTo;
  char *szTo = new char[wsValue.length() + 1];
  szTo[wsValue.size()] = '\0';
  WideCharToMultiByte(CP_UTF8, 0, wsValue.c_str(), -1, szTo, (int)wsValue.length(), NULL, NULL);
  strTo = szTo;
  delete[] szTo;

  return strTo;
}

What happens is that e.g. a Japanese character gets converted to two ASCII chars (漢 -> "w). I assume that's correct? But the other way does not work: when I call FromNativeToDotNet wizh "w I only get "w as a managed unicode string... How can I get the Japanese character correctly restored?

Solution

Try this instead:

String^ FromNativeToDotNet(std::string value)
{
  // Convert a UTF-8 string to a UTF-16 String
  int len = MultiByteToWideChar(CP_UTF8, 0, value.c_str(), value.length(), NULL, 0);
  if (len > 0)
  {
    std::vector<wchar_t> wszTo(len);
    MultiByteToWideChar(CP_UTF8, 0, value.c_str(), value.length(), &wszTo[0], len);
    return gcnew String(&wszTo[0], 0, len);
  }

  return gcnew String((wchar_t*)NULL);
}

std::string FromDotNetToNative(String^ value)
{ 
  // Pass on changes to native part
  pin_ptr<const wchar_t> wcValue = SafePtrToStringChars(value);

  // Convert a UTF-16 string to a UTF-8 string
  int len = WideCharToMultiByte(CP_UTF8, 0, wcValue, str->Length, NULL, 0, NULL, NULL);
  if (len > 0)
  {
    std::vector<char> szTo(len);
    WideCharToMultiByte(CP_UTF8, 0, wcValue, str->Length, &szTo[0], len, NULL, NULL);
    return std::string(&szTo[0], len);
  }

  return std::string();
}

OTHER TIPS

Best to use UTF8Encoding:

static String^ FromNativeToDotNet(std::string value)
{
    array<Byte>^ bytes = gcnew array<Byte>(value.length());
    System::Runtime::InteropServices::Marshal::Copy(IntPtr((void*)value.c_str()), bytes, 0, value.length());
    return (gcnew System::Text::UTF8Encoding)->GetString(bytes);
}


static std::string FromDotNetToNative(String^ value)
{ 
    if (value->Length == 0) return std::string("");
    array<Byte>^ bytes = (gcnew System::Text::UTF8Encoding)->GetBytes(value);
    pin_ptr<Byte> chars = &bytes[0];
    return std::string((char*)chars, bytes->Length);
}

a Japanese character gets converted to two ASCII chars (漢 -> "w). I assume that's correct?

No, that character, U+6F22, should be converted to three bytes: 0xE6 0xBC 0xA2

In UTF-16 (little endian) U+6F22 is stored in memory as 0x22 0x6F, which would look like "o in ascii (rather than "w) so it looks like something is wrong with your conversion from String^ to std::string.

I'm not familiar enough with String^ to know the right way to convert from String^ to std::wstring, but I'm pretty sure that's where your problem is.

I don't think the following has anything to do with your problem, but it is obviously wrong:

std::string strTo;
char *szTo = new char[wsValue.length() + 1];

You already know a single wide character can produce multiple narrow characters, so the number of wide characters is obviously not necessarily equal to or greater than the number of corresponding narrow characters.

You need to use WideCharToMultiByte to calculate the buffer size, and then call it again with a buffer of that size. Or you can just allocate a buffer to hold 3 times the number of chars as wide chars.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow