Domanda

In UTF8 I use to count characters (not bytes) using this function:

int schars(const char *s)
{
    int i = 0;

    while (*s) {
        if ((*s & 0xc0) != 0x80) i++;
        s++;
    }
    return i;
}

Does this work on implementations where plain char is unsigned char?

È stato utile?

Soluzione 2

It should.

You are only using binary operators and those function the same irrespective of whether the underlying data type is signed or unsigned. The only exception may be the != operator, but you could replace this with a & and then embrace the whole thing with a !, ala:

!((*s & 0xc0) & 0x80)

and then you have solely binary operators.

You can verify that the characters are promoted to integers by checking section 3.3.10 of the ANSI C Standard which states that "Each of the operands [of the bitwise AND] shall have integral type."

EDIT

I amend my answer. Bitwise operations are not the same on signed as on unsigned, as per 3.3 of the ANSI C Standard:

Some operators (the unary operator ~ , and the binary operators << , >> , & , ^ , and | , collectively described as bitwise operators )shall have operands that have integral type. These operators return values that depend on the internal representations of integers, and thus have implementation-defined aspects for signed types.

In fact, performing bitwise operations on signed integers is listed as a possible security hole here.

In the Visual Studio compiler signed and unsigned are treated the same (see here).

As this SO question discusses, it is better to use unsigned char to do byte-wise reads of memory and manipulations of memory.

Altri suggerimenti

It works as well when char is unsigned as it does when it's signed.

In both a signed 2's complement representation and in an unsigned representation, the 8th and 7th bits of a UTF8 code unit are 10 if and only if the code unit is not the first code unit of a code point. So you're counting 1 for the first code unit of each code point.

int is not guaranteed to be a large enough type to contain the number of characters in every string, but I assume you don't care ;-)

"Character" is potentially an ambiguous term. This code counts Unicode code points, which is not the same thing as displayable characters ("graphemes"). Sometimes multiple code points represent a single grapheme, for example when combining marks are used for accents. About the only practical use for knowing how many code points there are in a Unicode string, is to calculate how many bytes it will occupy when encoded as UTF-32. If you're careful, you can ensure that the only code that needs to process "characters" is the font engine, plus some complex operations like Unicode normalization and character encodings.

Yes, it will.

*s will be promoted to int before the computations take place. So, your code is equivalent to:

if (((int) *s & 0xC0) != 0x80) i++;

And the above will work even if char is unsigned.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top