What is the common procedure used when compilers statically type check “complex” expressions?

https://softwareengineering.stackexchange.com/questions/352112

15-01-2021
|

Question

Note: When I used "complex" in the title, I mean that the expression has many operators and operands. Not that the expression itself is complex.

I've recently been working on a simple compiler to x86-64 assembly. I've finished the compiler's main front end - the lexer and parser - and am now able to generate an Abstract Syntax Tree representation of my program. And since my language will be statically typed, I am now doing the next phase: type checking the source code. However, I've come to a problem and have not been able to reasonably solve it myself.

Consider the following example:

My compiler's parser has read this line of code:

int a = 1 + 2 - 3 * 4 - 5

And converted it to the following AST:

       =
     /   \
  a(int)  \
           -
         /   \
        -     5
      /   \
     +     *
    / \   / \
   1   2 3   4

Now it must type check the AST. it starts by first type checking the = operator. It first checks the left hand side of the operator. It sees that the variable a is declared as an integer. So it must now verify that the right hand side expression evaluates to an integer.

I understand how this could be done if the expression was just a single value, such as 1 or 'a'. But how would this be done for expressions with multiple values and operands - a complex expression - such as the one above? To correctly determine the value of the expression, it seems like the type checker would actual have to execute the expression itself and record the result. But this obviously seems to defeat the purpose of separating the compilation and execution phases.

The only other way I imagine this could be done is to recursively check the leaf of each subexpression in the AST and verify all of the leaf's types match the expected operator type. So starting with the = operator, the type checker would then scan all of the left hand side's AST and verify that the leafs are all integers. It would then repeat this for each operator in the subexpression.

I've tried researching the topic in my copy of "The Dragon Book", but it doesn't seem go in to much detail, and simply reiterates what I already know.

What is the usual method used when a compiler is type checking expressions with many operators and operands? Are any of the methods I mentioned above used? If not, what are the methods and how exactly would they work?

Solution

Recursion is the answer, but you descend into each subtree before handling the operation:

int a = 1 + 2 - 3 * 4 - 5

to tree form:

(assign (a) (sub (sub (add (1) (2)) (mul (3) (4))) (5))

Inferring the type happens by first walking the left hand side, then the right hand side, and then handling the operator as soon as the operands' types are inferred:

(assign*(a) (sub (sub (add (1) (2)) (mul (3) (4))) (5))

-> descend into lhs

(assign (a*) (sub (sub (add (1) (2)) (mul (3) (4))) (5))

-> infer a. a is known to be int. We're back in the assign node now:

(assign (int:a)*(sub (sub (add (1) (2)) (mul (3) (4))) (5))

-> descend into rhs, then into the lhs of the inner operators until we hit something interesting

(assign (int:a) (sub*(sub (add (1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub*(add (1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub (add*(1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub (add (1*) (2)) (mul (3) (4))) (5))

-> infer the type of 1, which is int, and return to the parent

(assign (int:a) (sub (sub (add (int:1)*(2)) (mul (3) (4))) (5))

-> go into the rhs

(assign (int:a) (sub (sub (add (int:1) (2*)) (mul (3) (4))) (5))

-> infer the type of 2, which is int, and return to the parent

(assign (int:a) (sub (sub (add (int:1) (int:2)*) (mul (3) (4))) (5))

-> infer the type of add(int, int), which is int, and return to the parent

(assign (int:a) (sub (sub (int:add (int:1) (int:2))*(mul (3) (4))) (5))

-> descend into the rhs

(assign (int:a) (sub (sub (int:add (int:1) (int:2)) (mul*(3) (4))) (5))

etc., until you end up with

(assign (int:a) (int:sub (int:sub (int:add (int:1) (int:2)) (int:mul (int:3) (int:4))) (int:5))*

Whether the assignment itself is also an expression with a type depends on your language.

The important takeaway: to determine the type of any operator node in the tree, you only have to look at its immediate children, which need to have a type assigned to them already.

OTHER TIPS

What is the usually method used when a compiler is type checking expressions with many operators and operands.

Read wikipages on type system and type inference and on Hindley-Milner type system, which uses unification. Read also about denotational semantics and operational semantics.

Type checking can be simpler if:

all your variables like a are explicitly declared with a type. This is like C or Pascal or C++98, but not like C++11 which has some type inference with auto.
all literal values like 1, 2, or 'c' have an inherent type: an int literal always has type int, a character literal always has type char, ….
functions and operators are not overloaded, e.g. the + operator always has type (int, int) -> int. C has overloading for operators (+ works for signed and unsigned integer types and for doubles) but no overloading of functions.

Under these constraints, a bottom up recursive AST type decoration algorithm could be enough (this only cares about types, not about concrete values, so is a compile-time approach):

For each scope, you keep a table for the types of all visible variables (called the environment). After a declaration int a, you would add the entry a: int to the table.
Typing of leaves is the trivial recursion base case: the type of literals like 1 is already known, and the type of variables like a can be looked up in the environment.
To type a expression with some operator and operands according to the previously computed types of the (nested sub-expressions) operands, we use recursion on the operands (so we type first these sub-expressions) and follow the typing rules related to the operator.

So in your example, 4 * 3 and 1 + 2 are typed int because 4 & 3 and 1 & 2 have been previously typed int and your typing rules say that the sum or product of two int-s is an int, and so on for (4 * 3) - (1 + 2).

Then Read Pierce's Types and Programming Languages book. I recommend to learn a tiny bit of Ocaml and λ-calculus

For more dynamically typed languages (Lisp like) read also Queinnec's Lisp In Small Pieces

Read also Scott's Programming Languages Pragmatics book

BTW, you can't have a language agnostic typing code, because the type system is an essential part of the language's semantics.

In C (and frankly most statically typed languages based on C) every operator can be seen as syntactic sugar for a function call.

So your expression can be rewritten as:

int a{operator-(operator-(operator+(1,2),operator*(3,4)),5)};

Then overload resolution will kick in and decide that every function is of the (int, int) or (const int&, const int&) type.

This way makes type resolution easy to understand and follow and (more importantly) easy to implement. Information about types only flows in 1 way (from the inner expressions outwards).

That is the reason why double x = 1/2; will result in x == 0 because 1/2 is evaluated as an int expression.

Focusing on your algorithm, try changing it to bottom-up. You know the type pf variables and constants; tag the node bearing the operator with the result type. Let the leaf determine the operator’s type, also the opposite of your idea.

It's actually quite easy, as long as you think of + as being a variety of functions rather than a single concept.

    int operator=(int)
     /   \
  a(int)  \
        int operator-(int,int)
         /                  \
    int operator-(int,int)    5
         /              \
int operator+(int,int) int operator*(int,int)
    / \                      / \
   1   2                    3   4

During the parsing stage of the right hand side, the parser retrieves 1, knows that's an int, then parses +,and stores that as an "unresolved function name", then it parses the 2, knows that's an int, and then returns that up the stack. The + function node now knows both parameter types, so can resolve the + into int operator+(int, int), so now it knows the type of this sub-expression, and the parser continues on it's merry way.

As you can see, once the tree is fully constructed, each node, including the function calls, knows its types. This is key because it allows for functions that return different types than their parameters.

char* ptr = itoa(3);

Here, the tree is:

    char* itoa(int)
     /           \
  ptr(char*)      3

The basis for type checking is not what the compiler does, it is what the language defines.

In the C language, every operand has a type. "abc" has type "array of const char". 1 has type "int". 1L has type "long". If x and y are expressions, then there are rules for the type of x + y and so on. So the compiler obviously has to follow the rules of the language.

On modern languages like Swift, the rules are much more complicated. Some cases are simple like in C. Other cases, the compiler sees an expression, has been told beforehand what type the expression should have, and determines the types of subexpressions based on that. If x and y are variables of different types, and an identical expression is assigned, that expression might be evaluated in a different way. For example assigning 12 * (2 / 3) will assign 8.0 to a Double and 0 to an Int. And you have cases where the compiler knows that two types are related and figures out what types they are based on that.

Swift example:

var x: Double
var y: Int

x = 12 * (2 / 3)
y = 12 * (2 / 3)

print (x, y)

prints "8.0, 0".

In the assignment x = 12 * (2 / 3): The left hand side has a known type Double, therefore the right hand side must have type Double. There is only one overload for the "*" operator returning Double, and that is Double * Double -> Double. Therefore 12 must have type Double, as well as 2 / 3. 12 supports the "IntegerLiteralConvertible" protocol. Double has an initialiser taking an argument of type "IntegerLiteralConvertible", so 12 is converted to Double. 2 / 3 must have type Double. There is only one overload for the "/" operator returning Double, and that is Double / Double -> Double. 2 and 3 are converted to Double. The result of 2 / 3 is 0.6666666. The result of 12 * (2 / 3) is 8.0. 8.0 is assigned to x.

In the assignment y = 12 * (2 / 3), y on the left hand side has type Int, so the right hand side must have type Int, so 12, 2, 3 are converted to Int with the result 2 / 3 = 0, 12 * (2 / 3) = 0.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange