Number of bytes of CString in C++

https://stackoverflow.com/questions/23177430

06-07-2023
|

Question

I have a Unicode string stored in CString and I need to know the number bytes this string takes in UTF-8 encoding. I know CString has a method getLength(), but that returns number of characters, not bytes.

I tried (beside other things) converting to char array, but I get (logically, I guess) only array of wchar_t, so this doesn't solve my problem.

To be clear about my goal. For the input lets say "aaa" I want "3" as output (since "a" takes one byte in UTF-8). But for the input "āaa", I'd like to see output "4" (since ā is two byte character).

I think this has to be quite common request, but even after 1,5 hours of search and experimenting, I couldn't find the correct solution.

I have very little experience with Windows programming, so maybe I left out some crucial information. If you feel like that, please let me know, I'll add any information you request.

Solution

As your CString contains a series of wchar_t, you can just use WideCharToMultiByte with the output charset as CP_UTF8. The function will return the number of bytes written to the output buffer, or the length of the UTF-8 encoded string

LPWSTR instr;
char outstr[MAX_OUTSTR_SIZE];
int utf8_len = WideCharToMultiByte(CP_UTF8, 0, instr, -1, outstr, MAX_OUTSTR_SIZE, NULL, NULL);

If you don't need the output string, you can simply set the output buffer size to 0

cbMultiByte

Size, in bytes, of the buffer indicated by lpMultiByteStr. If this parameter is set to 0, the function returns the required buffer size for lpMultiByteStr and makes no use of the output parameter itself.

In that case the function will return the number of bytes in UTF-8 without really outputting anything

int utf8_len = WideCharToMultiByte(CP_UTF8, 0, instr, -1, NULL, 0, NULL, NULL);

If your CString is really CStringA, i.e. _UNICODE is not defined, then you need to use MultiByteToWideChar to convert the string to UTF-16 and then convert from UTF-16 to UTF-8 with WideCharToMultibyte. See How do I convert an ANSI string directly to UTF-8? But new code should never be compiled without Unicode support anyway

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow