Converting UTF-8 to UTF-32, pre-calculating the number of 'chars' in each

https://stackoverflow.com/questions/10744147

10-06-2021
|

Question

I have a working algorithm to convert a UTF-8 string to a UTF-32 string, however, I have to allocate all the space for my UTF-32 string ahead of time. Is there any way to know how many characters in UTF-32 that a UTF-8 string will take up.

For example, the UTF-8 string "¥0" is 3 chars, and once converted to UTF-32 is 2 unsigned ints. Is there any way to know the number of UTF-32 'chars' I will need before doing the conversion? Or am I going to have to re-write the algorithm?

Solution

There are two basic options:

You could make two passes through the UTF-8 string, the first one counting the number of UTF-32 characters you'll need to generate, and the second one actually writing them to a buffer.
Allocate the max number of 32-bit chars you could possibly need -- i.e., the length of the UTF-8 string. This is wasteful of memory, but means you can transform utf8->utf32 in one pass.

You could also use a hybrid -- e.g., if the string is shorter than some threshold then use the second approach, otherwise use the first.

For the first approach, the first pass would look something like this:

size_t len=0;  // warning: untested code.
for(const char *p=src; *p; ++p) {
    // characters that begin with binary 10xxxxxx... are continuations; all other
    // characters should begin a new utf32 char (assuming valid utf8 input)
    if ((*p & 0xc0) != 0x80) ++len;
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow