What is an Abstract Syntax Tree/Is it needed?

https://stackoverflow.com/questions/11894326

25-06-2021
|

Question

I've been interested in compiler/interpreter design/implementation for as long as I've been programming (only 5 years now) and it's always seemed like the "magic" behind the scenes that nobody really talks about (I know of at least 2 forums for operating system development, but I don't know of any community for compiler/interpreter/language development). Anyways, recently I've decided to start working on my own, in hopes to expand my knowledge of programming as a whole (and hey, it's pretty fun :). So, based off the limited amount of reading material I have, and Wikipedia, I've developed this concept of the components for a compiler/interpreter:

Source code -> Lexical Analysis -> Abstract Syntax Tree -> Syntactic Analysis -> Semantic Analysis -> Code Generation -> Executable Code.

(I know there's more to code generation and executable code, but I haven't gotten that far yet :)

And with that knowledge, I've created a very basic lexer (in Java) to take input from a source file, and output the tokens into another file. A sample input/output would look like this:

Input:

int a := 2
if(a = 3) then
    print "Yay!"
endif

Output (from lexer):

INTEGER
A
ASSIGN
2
IF
L_PAR
A
COMP
3
R_PAR
THEN
PRINT
YAY!
ENDIF

Personally, I think it would be really easy to go from there to syntactic/semantic analysis, and possibly even code generation, which leads me to question: Why use an AST, when it seems that my lexer is doing just as good a job? However, 100% of my sources I use to research this topic all seem adamant that this is a necessary part of any compiler/interpreter. Am I missing the point of what an AST really is (a tree that shows the logical flow of a program)?

TL;DR: Currently in route to develop a compiler, finished the lexer, seems to me like the output would make for easy syntactic analysis/semantic analysis, rather than doing an AST. So why use one? Am I missing the point of one?

Thanks!

Solution

First off, one thing about your list of components does not make sense. Building an AST is (pretty much) the syntactic analysis, so it either shouldn't be in there, or at least come before the AST.

What you got there is a lexer. All it gives you are individual tokens. In any case, you will need an actual parser, because regular languages aren't any fun to program in. You can't even (properly) nest expressions. Heck, you can't even handle operator precedence. A token stream doesn't give you:

An idea where statements and expressions start and end.
An idea how statements are grouped into blocks.
An idea Which part of the expression has which precedence, associativity, etc.
A clear, uncluttered view at the actual structure of the program.
A structure which can be passed through a myriad of transformations, without every single pass knowing and having code to accomodate that the condition in an if is enclosed by parentheses.
... more generally, any kind of comprehension above the level of a single token.

Suppose you have two passes in your compiler which optimize certain kinds of operators applies to certain arguments (say, constant folding and algebraic simplifications like x - x -> 0). If you hand them tokens for the expression x - x * 1, these passes are cluttered with figuring out that the x * 1 part comes first. And they have to know that, lest the transformation is incorrect (consider 1 + 2 * 3).

These things are tricky enough to get right as it is, so you don't want to be pestered by parsing problems as well. That's why you solve the parsing problem first, in a separate parsing step. Then you can, say, replace a function call with its definition, without worrying about adding parenthesis so the meaning remains the same. You save time, you separate concerns, you avoid repetition, you enable simpler code in many other places, etc.

A parser figures all that out, and builds an AST which consequently holds all that information. Without any further data on the nodes, the shape of the AST alone gives you no. 1, 2, 3, and much more, for free. None of the bazillion passes that follow have to worry about it anymore.

That's not to say you always have to have an AST. For sufficiently simple languages, you can do a single-pass compiler. Instead of generating an AST or some other intermediate representation during parsing, you emit code as you go. However, this becomes harder for less simple languages and you can't reasonably do a lot of stuff (such as 70% of all optimizations and diagnostics -- and yes I just made that number up). Generally, I wouldn't advise you to do this. There are good reasons single-pass compilers are mostly dead. Even languages which permit them (e.g. C) are nowadays implemented with multiple passes and ASTs. It's a simple way to get started, but will severely limit you (and the language, if you design it) later.

OTHER TIPS

You've got the AST at the wrong point in your flow diagram. Typically, the output of the lexer is a series of tokens (as you have in your output), and these are fed to the parser/syntactic analyzer, which generates the AST. So the output of your lexer is different from an AST because they are used at different points in the compilation process and fulfill different purposes.

The next logical question is: What, then, is an AST? Well, the purpose of parsing/syntactic analysis is to turn the series of tokens generated by the lexer into an AST (or parse tree). The AST is an intermediate representation that captures the relationship between syntactical elements in a way that is easier to work with programmatically. One way of thinking about this is that a text program is a one dimensional construct, and can only represent ideas as a sequence of elements, while the AST is freed from this constraint, and can represent the underlying relationships between those elements in 2 dimensions (as typically drawn), or any higher dimension space if you so choose to think about it that way.

For instance, a binary operator has two operands, let's call them A and B. In code, this may be spelled 'A * B' (assuming an infix operator - another advantage of an AST is to hide such distinctions that may be important syntactically, but not semantically), but for the compiler to "understand" this expression, it must read 5 characters sequentially, and this logic can quickly become cumbersome, given the many possibilities in even a small language. In an AST representation, however, we have a "binary operator" node whose value is '*', and that node has two children, values 'A' and 'B'.

As your compiler project progresses, I think you will begin to see the advantages of this representation.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow