Question

I'm experimenting with parsing on my free time, and I wanted to implement a shift-reduce parser for a very very simple grammar. I've read many online articles but I'm still confused on how to create parse trees. Here's an example of what I want to do:


Grammar:

Expr -> Expr TKN_Op Expr 
Expr -> TKN_Num

Here's an example input:

1 + 1 + 1 + 1

That, after tokenization, becomes:

TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num

I understand that:

  1. Shifting means pushing the first input token on the stack and removing it from the input
  2. Reducing means substituting one or more elements on the stack with a grammar element

So, basically, this should happen:

Step 1:
    Stack:
    Input: TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num
    What: Stack is empty. Shift.

Step 2:
    Stack: TKN_Num
    Input: TKN_Op TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num
    What: TKN_Num can be reduced to Expr. Reduce.

Step 3:
    Stack: Expr
    Input: TKN_Op TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num
    What: Cannot reduce. Shift.

Step 4:
    Stack: Expr TKN_Op 
    Input: TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num
    What: Cannot reduce. Shift.

Step 5:
    Stack: Expr TKN_Op TKN_Num
    Input: TKN_Op TKN_Num TKN_Op TKN_Num
    What: TKN_Num can be reduced to Expr. Reduce.
    // What should I check for reduction? 
    // Should I try to reduce incrementally using 
    // only the top of the stack first, 
    // then adding more stack elements if I couldn't
    // reduce the top alone?

Step 6:
    Stack: Expr TKN_Op Expr
    Input: TKN_Op TKN_Num TKN_Op TKN_Num
    What: Expr TKN_Op Expr can be reduced to Expr. Reduce.

Step 7:
    Stack: Expr
    Input: TKN_Op TKN_Num TKN_Op TKN_Num
    What: ...
    // And so on...

Apart from the "what to reduce?" doubt, I have no clue how to correctly build a parse tree. The tree should probably look like this:

1    +    o
          |
     1    +    o
               |
          1    +    1

Should I create a new node on reduction?

And when should I add children to the newly create node / when should I create a new root node?

Was it helpful?

Solution

The simple and obvious thing to do is to create a tree node on every reduction, and add the tree nodes from the grammar elements that were reduced to that tree node.

This is easily managed with a node stack that runs in parallel to the "shift token" stack that the raw parser uses. For every reduction for a rule of length N, the shift-token stack is shortened by N, and nonterminal token is pushed on the shift stack. At the same time, shorten the node stack by removing the top N nodes, create a node for the nonterminal, attached the removed N nodes as children, and push that node onto the node stack.

This policy even works with rules that have zero-length right hand side: create a tree node and attach the empty set of children to it (e.g., create a leaf node).

If you think of a "shift" on a terminal node as a reduction (of the characters forming the terminal) to the terminal node, then terminal node shifts fit right in. Create a node for the terminal, and push it onto the stack.

If you do this, you get a "concrete syntax/parse tree" that matches the grammar isomorphically. (We do this for a commercial tool I offer). There are lots of folks that don't like such concrete trees, because they contain nodes for keywords, etc., which don't really add much value. True, but such trees are supremely easy to construct, and supremely easy to understand becuase the grammar is the tree structure. When you have 2500 rules (as we do for a full COBOL parser), this matters. This is also convenient because all the mechanism can be built completely into the parsing infrastructure. The grammar engineer simply writes rules, the parser runs, voila, a tree. It is also easy to change the grammar: just change it, voila, you still get parse trees.

However, if you don't want a concrete tree, e.g., you want an "abstract syntax tree", then what you have to do is let the grammar engineer control which reductions generate nodes; usually be adding some procedural attachment (code) to each grammar rule to be executed on a reduction step. Then if any such procedural attachement produces a node, it is retained on a node stack. Any procedural attachment which produces a node must attach nodes produced by the right hand elements. If any this is what YACC/Bison/... most of the shift-reduce parser engines do. Go read about Yacc or Bison and examine a grammar. This scheme gives you lot of control, at the price of insisting that you take that control. (For what we do, we don't want this much engineering effort in building a grammar).

In the case of producing CSTs, it is conceptually straightforward to remove "useless" nodes from trees; we do that in our tool. The result is a lot like an AST, without the manual effort to write all those procedural attachments.

OTHER TIPS

The reason of your trouble is that you have a shift/reduce conflict in your grammar:

expr: expr OP expr
    | number

You can resolve this in 2 ways:

expr: expr OP number
    | number

for left associative operators, or

expr: number OP expr
    | number

for right associative ones. This should also determine the shape of your tree.

Reduction is usually done when one clause is detected complete. In the right associative case, you would start in a state 1 that expects a number, pushes it onto the value stack and shifts to state 2. In state 2, if the token is not an OP, you can reduce a number to an expr. Otherwise, push the operator and shift to state 1. Once state 1 is complete, you can reduce the number, operator and expression to another expression. Note, you need a mechanism to "return" after a reduction. The overall parser would then start in state 0, say, which immediately goes to state 1 and accepts after reduction.

Note that tools like yacc or bison make this kind of stuff much easier because they bring all the low level machinery and the stacks.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top