Question

I've got the dragonbook but it doesn't seem to handle that topic...

In the most modern languages it's possible to use certain variables even if their appearance in the code is unordered.

Example

class Foo {
    void bar() {
        plonk = 42;
    }
    int plonk;
}

It doesn't matter that the variable plonk is declared after the function.

Question
Is there any best practice/useful pattern how to implement this? There are two approaches which cam to my mind:

  1. While parsing add dummy symbols for unseen symbols. When the declaration is parsed those dummies get replaced by their real symbols. After the parsing we can check if there are dummies left and if so output an error.

  2. Don't do any symbol stuff while parsing but only create the AST. After parsing step through the AST and depending on the node add symbols. For e.g. a class node add symbols of the children and process them after. For e.g. statement blocks step through children and add symbols immediatly before the child is processed.

I would expect approach 1. is easier and also more useful for stuff like "importing other compilation units".

Edit:
A problem i see with approach 1 is that there needs some kind of handling for ordered symbols. E.g. withing a function it's not possible to use a local symbol before it is used.

Was it helpful?

Solution

If you can, just build the AST and the symbol table during the parse. Then make a pass over the AST to associate symbols with symbol table entries. That's essentially your strategy #2.

The problem with strategy #1, in the general case, is that you don't necessarily know that two instances of the same name are bound to the same symbol, until you see all the declarations. Consider, for example, a language like javascript in which the binding domain for a symbol is a function block (a mistake IMHO, but tastes vary) but symbols do not need to be declared before use. In this case, we'll only consider symbols which name functions.

Pseudocode (legal javascript, as it turns out):

function outer() {
  return foo();

  function inner() {
    return foo();

    function foo() {
      return "inner's foo";
    }
  }

  function foo() {
     return "outer's foo";
  }
}

The two uses of foo refer to different symbols, something you can't know until you reach the last definition of foo.

The problem with strategy #2 is that it is not always possible to build an AST without knowing something about the symbols being used. For example, in C you can't really parse an expression like (x)(y) without knowing whether x is a typename or a something which can be dereferenced into a function. (Also a mistake, IMHO, but who am I?). In C++, you also need to know whether a given symbol is a template or not. Often, this is described as the "kind" of a symbol, as opposed to "type". In C++, you don't need to know what the "type" of x is to parse (x)(y); you just need to know whether or not it has one. For this reason, C++ allows use of certain symbols before declaration, but not if the declaration is a typedef.

Leaving pathological cases and macro processors aside, it is usually possible to define scopes during the parse, and attach each declaration to a scope. Normally scopes nest in a fairly simple manner, so once you've built the scope tree you can look up any symbol given the current scope node, just by walking up the tree until the symbol is found.

In some languages (like python), declarations are optional and implicit; in such a case, you can attach a new definition to the current scope in a second pass if the symbol is not found.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top