Reading characters outside ASCII
Question
A friend of mine showed me a situation where reading characters produced unexpected behaviour. Reading the character '¤' caused his program to crash. I was able to conclude that '¤' is 164 decimal so it's over the ASCII range.
We noticed the behaviour on '¤' but any character >127 seems to show the problem. The question is how would we reliably read such characters char by char?
int main(int argc, const char *argv[])
{
char input;
do
{
cin >> input;
cout << input;
cout << " " << setbase(10) << (int)input;
cout << " 0x" << setbase(16) << (int)input;
cout << endl;
} while(input);
return 0;
}
masse@libre:temp/2009-11-30 $ ./a.out
¤
 -62 0xffffffc2
¤ -92 0xffffffa4
Solution
Your system is using UTF-8 character encoding (as it should) so the character '¤' causes your program to read the sequence of bytes C2 A4
. Since a char
is one byte, it reads them one at a time. Look into the wchar_t
and the corresponding wcin
and wcout
streams to read multibyte characters, although I don't know which encodings they support or how they play with locales.
Also, your program is outputting invalid UTF-8, so you really shouldn't be seeing those two characters — I get question marks on my system.
(This is a nitpick and somewhat offtopic, but your while(input)
should be while(cin)
, otherwise you'll get an infinite loop.)
OTHER TIPS
It is hard to tell why your friend's program is crashing without seeing the code, but it could be because you are using the char as an index into an array. Since characters outside of the regular ASCII range will overflow the limit of a signed char, the char will end up negative.
declare 'input' as unsigned char instead