Unsigned character gotchas in C

https://stackoverflow.com/questions/20850877

23-09-2022
|

Domanda

Most C compilers use signed characters. Most C libraries define EOF as -1.

Despite being a long-time C programmer I had never before put these two facts together and so in the interest of robust and international software I would ask for a bit of help in spelling out the implications.

Here is what I have discovered thus far:

fgetc() and friends cast to unsigned characters before returning as int to avoid clashing with EOF.
Therefore care needs to be taken with the results, e.g. getchar() == (unsigned char) 'µ'.
~~Theoretically I believe that not even the basic character set is guaranteed to be positive.~~
The <ctype.h> functions are designed to handle EOF and expected unsigned characters. Any other negative input may cause out-of-bounds addressing.
Most functions taking character parameters as integers ignore EOF and will accept signed or unsigned characters interchangeably.
String comparison (strcmp/strncmp/memcmp) compares unsigned character strings.
It may not be impossible to discriminate EOF from a proper characters on systems where sizeof(int) = 1.
The wide characters functions are not used for binary I/O and so WEOF is defined within the range of wchar_t.

Is this assessment correct and if so what other gotchas did I miss?

_{Full disclosure: I ran into an out-of-bounds indexing bug today when feeding non-ASCII characters to isspace() and the realization of the amount of lurking bugs in my old code both scared and annoyed me. Hence this frustrated question.}

Soluzione

The basic execution character set is guaranteed to be nonnegative - the precise wording in C99 is:

If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow