How can char[] represent an UTF-8 string?

https://stackoverflow.com/questions/8819775

15-04-2021
|

Question

In C11, a new string literal has been added with the prefix u8. This returns an array of chars with the text encoded to UTF-8. How is this even possible? Isn't a normal char signed? Meaning it has one bit less of information to use because of the sign-bit? My logic would depict that a string of UTF-8 text would need to be an array of unsigned chars.

Solution

Isn't a normal char signed?

It's implementation-dependent whether char is signed or unsigned.

Further, the sign bit isn't "lost", it can still be used to represent information, and char is not necessarily 8 bits large (it might be larger on some platforms).

OTHER TIPS

There is a potential problem here:

If an implementation with CHAR_BIT == 8 uses sign-magnitude representation for char (so char is signed), then when UTF-8 requires the bit-pattern 10000000, that's a negative 0. So if the implementation further does not support negative 0, then a given UTF-8 string might contain an invalid (trap) value of char, which is problematic. Even if it does support negative zero, the fact that bit pattern 10000000 compares equal as a char to bit pattern 00000000 (the nul terminator) is liable to cause problems when using UTF-8 data in a char[].

I think this means that for sign-magnitude C11 implementations, char needs to be unsigned. Normally it's up to the implementation whether char is signed or unsigned, but of course if char being signed results in failing to implement UTF-8 literals correctly then the implementer just has to pick unsigned. As an aside, this has been the case for non-2's complement implementations of C++ all along, since C++ allows char as well as unsigned char to be used to access object representations. C only allows unsigned char.

In 2's complement and 1s' complement, the bit patterns required for UTF-8 data are valid values of signed char, so the implementation is free to make char either signed or unsigned and still be able to represent UTF-8 strings in char[]. That's because all 256 bit patterns are valid 2's complement values, and UTF-8 happens not to use the byte 11111111 (1s' complement negative zero).

No, a sign bit is a bit nonetheless! And the UTF-8 specification itself doesn't say that the chars must be unsigned.

PS Wat is kookwekker voor 'n naam?

The signedness of char does not matter; utf8 can be handled with only shift and mask operations (which may be cumbersome for signed types, but not impossible) But: utf8 needs at least 8 bits, so "assert (CHAR_BIT >= 8);"

To illustrate by point: the following fragments contains no arithmetic operations on the character's value, only shift&mask.

static int eat_utf8(unsigned char *str, unsigned len, unsigned *target)
{
unsigned val = 0;
unsigned todo;

if (!len) return 0;

val = str[0];
if ((val & 0x80) == 0x00) { if (target) *target = val; return 1; }
else if ((val & 0xe0) == 0xc0) { val &= 0x1f; todo = 1; }
else if ((val & 0xf0) == 0xe0) { val &= 0x0f; todo = 2; }
else if ((val & 0xf8) == 0xf0) { val &= 0x07; todo = 3; }
else if ((val & 0xfc) == 0xf8) { val &= 0x03; todo = 4; }
else if ((val & 0xfe) == 0xfc) { val &= 0x01; todo = 5; }
else {  /* Default (Not in the spec) */
        if (target) *target = val;
        return -1; }


len--;str++;
if (todo > len) { return -todo; }

for(len=todo;todo--;) {
        /* For validity checking we should also
        ** test if ((*str & 0xc0) == 0x80) here */
        val <<= 6;
        val |= *str++ & 0x3f;
        }

if (target) *target = val;
return  1+ len;
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow