Character data is, in most modern machines, managed as 8-bit bytes. (In some cases the characters are 16 or 32 bits, but that's just confusion at this juncture.)
If you look at an ASCII table you will see the basic "Latin" character set:
The individual characters are identified by an 8-bit byte where (for the basic ASCII chars) the high-order bit is zero. So values run between 0 and 127, or between 00 and 7F hex (or between 00000000 and 01111111 binary).
I should inject here that the first 32 codes are non-printing codes for "control characters". Eg, the code at decimal 10 or hex 0A is the "line-feed" code, which is the code known in C and Java as "newline". And the 00 code is the "NUL" character as mentioned below.
The characters in a sentence are laid out in order in memory, in successive bytes. Hence, "Hello" will be 48 65 6C 6C 6F
in hex. For C and C++ a simple "C string" is always ended with a byte of all zeros (the "NUL" character in the chart). For Java the length of the string is in a separate variable somewhere else. A few character coding schemes "prefix" the string with it's length as an 8-bit or 16-bit value.
As you can see above, the ASCII character set includes non-alphabetic characters such as !
and +
and ?
. For "non-Latin" characters (eg, the character £
or Ç
) one of several techniques is used to "extend" the character set. Sometimes those 8-bit characters with values of 128 to 255 are used to represent the non-Latin characters of a given language (though one must know which language in order to know which set of characters is being represented). In other cases "Unicode" is used, with 16-bit or 32-bit characters instead of 8-bit characters, so that virtually every character in every language has its own unique code.