How does C know what type to expect?

Question 1

how does the system keep track of what sort of number a byte represents?

"The system" doesn't. During translation, the compiler knows the types of the objects it's dealing with, and generates the appropriate machine instructions for dealing with those values.

Question 2

Ooh, good question. Let's start with the CPU - assuming an Intel x86 chip.

It turns out the CPU does not know whether a byte is "signed" or "unsigned." So when you add two numbers - or do any operation - a "status register" flag is set.

Take a look at the "sign flag." When you add two numbers, the CPU does just that - adds the numbers and stores the result in a register. But the CPU says "if instead we interpreted these numbers as twos complement signed integers, is the result negative?" If so, then that "sign flag" is set to 1.

So if your program cares about signed vs unsigned, writing in assembly, you would check the status of that flag and the rest of your program would perform a different task based on that flag.

So when you use signed int versus unsigned int in C, you are basically telling the compiler how (or whether) to use that sign flag.

Question 3

The code that is executed has no information about the types. The only tool that knows the types is the compiler at the time it compiles the code. Types in C are solely a restriction at compile time to prevent you from using the wrong type somewhere. While compiling, the C compiler keeps track of the type of each variable and therefore knows which type belongs to which variable.

This is the reason why you need to use format strings in printf, for example. printf has no chance of knowing what type it will get in the parameter list as this information is lost. In languages like go or java you have a runtime with reflection capabilities which makes it possible to get the type.

Suppose your compiled C code would still have type information in it, there would be the need for the resulting assembler language to check for types. It turns out that the only thing close to types in assembly is size of the operands for an instruction determined by suffixes (in GAS). So what is left from your type information is the size and nothing more.

One example for assembly which supports type is the java VM bytecode, which has type suffixes for operands for primitives.

Question 4

It is important to remember that C and C++ are high level languages. The compiler's job is to take the plain text representation of the code and build it into the platform specific instructions the target platform is expecting to execute. For most people using PCs this tends to be x86 assembly.

This is why C and C++ are so loose with how they define the basic data types. For example most people say there are 8 bits in a byte. This is not defined by the standard and there is nothing against some machine out there having 7 bits per byte as its native interpretation of data. The standard only recognizes that a byte is the smallest addressable unit of data.

So the interpretation of data is up to the instruction set of the processor. In many modern languages there is another abstraction on top of this, the Virtual Machine.

If you write your own scripting language it is up to you to define how you interpret your data in software.

Question 5

Using C besides the compiler, that perfectly well knows about the type of the given values there is no system that knows about the type of a given value.

Note that C by itself doesn't bring any runtime type information system with it.

Take a look at the following example:

int i_var;
double d_var;

int main () {

  i_var = -23;
  d_var = 0.1;

  return 0;
}

In the code there are two different types of values involved one to be stored as an integer and one to be stored as a double value.

The compiler that analyzes the code pretty well knows about the exact types of both of them. Here the dump of a short fragment of the type information gcc held while generation code generated by passing the -fdump-tree-all to gcc:

@1      type_decl        name: @2       type: @3       srcp: <built-in>:0      
                         chan: @4      
@2      identifier_node  strg: int      lngt: 3       
@3      integer_type     name: @1       size: @5       algn: 32      
                         prec: 32       sign: signed   min : @6      
                         max : @7      
...
@5      integer_cst      type: @11      low : 32      
@6      integer_cst      type: @3       high: -1       low : -2147483648 
@7      integer_cst      type: @3       low : 2147483647 
...

@3805   var_decl         name: @3810    type: @3       srcp: main.c:3      
                         chan: @3811    size: @5       algn: 32      
                         used: 1       
...
@3810   identifier_node  strg: i_var    lngt: 5

Hunting down the @links you should clearly see that there really is a lot of information stored about memory-size, alignment-constraints and allowed min- and max-values for the type "int" stored in the nodes @1-3 and @5-7. (I left out the @4 node as the mentioned "chan" entry is just used to cha i n up any type definitions in the generated tree)

Reagarding the variable declared at main.c line 3 it is known, that it is holding a value of type int as seen by the type reference to node @3.

You'll sure be able to hunt down the double entries and the ones for d_var in an own experiment yourself if you don't trust me they will also there.

Taking a look at the generated assembler code (using gcc pass the -S switch) listed we can take a look at the way the compiler used this information in code generation:

    .file   "main.c"
    .comm   i_var,4,4
    .comm   d_var,8,8
    .text
.globl main
    .type   main, @function
main:
    pushl   %ebp
    movl    %esp, %ebp
    movl    $-23, i_var
    fldl    .LC0
    fstpl   d_var
    movl    $0, %eax
    popl    %ebp
    ret
    .size   main, .-main
    .section    .rodata
    .align 8
.LC0:
    .long   -1717986918
    .long   1069128089
    .ident  "GCC: (Debian 4.4.5-8) 4.4.5"
    .section    .note.GNU-stack,"",@progbits

Taking a look at the assignment instructions you will see that the compiler figured out the right instructions "mov" to assign our int value and "fstp" to assign our "double" value.

Nevertheless besides the instructions chosen at the machine level there is no indication of the type of those values. Taking a look at the value stored at .LC0 the type "double" of the value 0.1 was even broken down in two consecutive storage locations each for a long to meet the known "types" of the assembler.

As a matter of fact breaking the value up this way was just one choice of other possiblities, using 8 consecutive values of "type" .byte would have done equally well.