Question

A friend of mine showed me a situation where reading characters produced unexpected behaviour. Reading the character '¤' caused his program to crash. I was able to conclude that '¤' is 164 decimal so it's over the ASCII range.

We noticed the behaviour on '¤' but any character >127 seems to show the problem. The question is how would we reliably read such characters char by char?

int main(int argc, const char *argv[])
{
    char input;
    do
    {
        cin >> input;
        cout << input;
        cout << " " << setbase(10) << (int)input;
        cout << " 0x" << setbase(16) << (int)input;

        cout << endl;
    } while(input);
    return 0;
}


masse@libre:temp/2009-11-30 $ ./a.out 
¤
 -62 0xffffffc2
¤ -92 0xffffffa4
Was it helpful?

Solution

Your system is using UTF-8 character encoding (as it should) so the character '¤' causes your program to read the sequence of bytes C2 A4. Since a char is one byte, it reads them one at a time. Look into the wchar_t and the corresponding wcin and wcout streams to read multibyte characters, although I don't know which encodings they support or how they play with locales.

Also, your program is outputting invalid UTF-8, so you really shouldn't be seeing those two characters — I get question marks on my system.

(This is a nitpick and somewhat offtopic, but your while(input) should be while(cin), otherwise you'll get an infinite loop.)

OTHER TIPS

It is hard to tell why your friend's program is crashing without seeing the code, but it could be because you are using the char as an index into an array. Since characters outside of the regular ASCII range will overflow the limit of a signed char, the char will end up negative.

declare 'input' as unsigned char instead

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top