Compiler construction: Handle references to unordered symbols

Question

If you can, just build the AST and the symbol table during the parse. Then make a pass over the AST to associate symbols with symbol table entries. That's essentially your strategy #2.

The problem with strategy #1, in the general case, is that you don't necessarily know that two instances of the same name are bound to the same symbol, until you see all the declarations. Consider, for example, a language like javascript in which the binding domain for a symbol is a function block (a mistake IMHO, but tastes vary) but symbols do not need to be declared before use. In this case, we'll only consider symbols which name functions.

Pseudocode (legal javascript, as it turns out):

function outer() {
  return foo();

  function inner() {
    return foo();

    function foo() {
      return "inner's foo";
    }
  }

  function foo() {
     return "outer's foo";
  }
}

The two uses of foo refer to different symbols, something you can't know until you reach the last definition of foo.

The problem with strategy #2 is that it is not always possible to build an AST without knowing something about the symbols being used. For example, in C you can't really parse an expression like (x)(y) without knowing whether x is a typename or a something which can be dereferenced into a function. (Also a mistake, IMHO, but who am I?). In C++, you also need to know whether a given symbol is a template or not. Often, this is described as the "kind" of a symbol, as opposed to "type". In C++, you don't need to know what the "type" of x is to parse (x)(y); you just need to know whether or not it has one. For this reason, C++ allows use of certain symbols before declaration, but not if the declaration is a typedef.

Leaving pathological cases and macro processors aside, it is usually possible to define scopes during the parse, and attach each declaration to a scope. Normally scopes nest in a fairly simple manner, so once you've built the scope tree you can look up any symbol given the current scope node, just by walking up the tree until the symbol is found.

In some languages (like python), declarations are optional and implicit; in such a case, you can attach a new definition to the current scope in a second pass if the symbol is not found.