Question

I'm in the process of creating a program that will scrub extended ASCII characters from text documents. I'm trying to understand how C# is interpreting the different character sets and codes, and am noticing some oddities.

Consider:

namespace ASCIITest
{
    class Program
    {
        static void Main(string[] args)
        {
            string value = "Slide™1½”C4®";
            byte[] asciiValue = Encoding.ASCII.GetBytes(value);   // byte array
            char[] array = value.ToCharArray();                   // char array
            Console.WriteLine("CHAR\tBYTE\tINT32"); 
            for (int i = 0; i < array.Length; i++)
            {
                char  letter     = array[i];
                byte  byteValue  = asciiValue[i];
                Int32 int32Value = array[i];
                 //
                Console.WriteLine("{0}\t{1}\t{2}", letter, byteValue, int32Value);
            }
            Console.ReadLine();
        }
    }
}

Output from program

CHAR    BYTE    INT32
S       83      83
l       108     108
i       105     105
d       100     100
e       101     101
T       63      8482      <- trademark symbol
1       49      49
½       63      189       <- fraction
"       63      8221      <- smartquotes
C       67      67
4       52      52
r       63      174       <- registered trademark symbol

In particular, I'm trying to understand why the extended ASCII characters (the ones with my notes added to the right of the third column) show up with the correct value when cast as int32, but all show up as 63 when cast as the byte value. What's going on here?

Was it helpful?

Solution

ASCII.GetBytes conversion replaces all characters outside of ASCII range (0-127) with question mark (code 63).

So since your string contains characters outside of that range your asciiValue have ? instead of all interesting symbols like - its Char (Unicode) repesentation is 8482 which is indeed outside of 0-127 range.

Converting string to char array does not modify values of characters and you still have original Unicode codes (char is essentially Int16) - casting it to longer integer type Int32 does not change the value.

Below are possible conversion of that character into byte/integers:

var value = "™";
var ascii = Encoding.ASCII.GetBytes(value)[0]; // 63(`?`) - outside 0-127 range
var castToByte = (byte)(value[0]); // 34 = 8482 % 256
var Int16 = (Int16)value[0]; // 8482 
var Int32 = (Int16)value[0]; // 8482 

Details available at ASCIIEncoding Class

ASCIIEncoding corresponds to the Windows code page 20127. Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F. If you use the default encoder returned by the Encoding.ASCII property or the ASCIIEncoding constructor, characters outside that range are replaced with a question mark (?) before the encoding operation is performed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top