Domanda

Most C compilers use signed characters. Most C libraries define EOF as -1.

Despite being a long-time C programmer I had never before put these two facts together and so in the interest of robust and international software I would ask for a bit of help in spelling out the implications.

Here is what I have discovered thus far:

  • fgetc() and friends cast to unsigned characters before returning as int to avoid clashing with EOF.
  • Therefore care needs to be taken with the results, e.g. getchar() == (unsigned char) 'µ'.
  • Theoretically I believe that not even the basic character set is guaranteed to be positive.
  • The <ctype.h> functions are designed to handle EOF and expected unsigned characters. Any other negative input may cause out-of-bounds addressing.
  • Most functions taking character parameters as integers ignore EOF and will accept signed or unsigned characters interchangeably.
  • String comparison (strcmp/strncmp/memcmp) compares unsigned character strings.
  • It may not be impossible to discriminate EOF from a proper characters on systems where sizeof(int) = 1.
  • The wide characters functions are not used for binary I/O and so WEOF is defined within the range of wchar_t.

Is this assessment correct and if so what other gotchas did I miss?

Full disclosure: I ran into an out-of-bounds indexing bug today when feeding non-ASCII characters to isspace() and the realization of the amount of lurking bugs in my old code both scared and annoyed me. Hence this frustrated question.

È stato utile?

Soluzione

The basic execution character set is guaranteed to be nonnegative - the precise wording in C99 is:

If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top