LL(1) parser implemented with stack: how to build AST?

Question 1

Your stack can be annotated so that it contains the AST entry reference (i.e. rule number + position in rule + target data where to store) + (terminal/non terminal)

Your initial stack <- START_SYMBOL is annotated to store its result in the AST root.

Basically, your pop() selects the current AST construct. Then the next <- next token saves the value in your AST. The stack.push(PARSING_TABLE[top, next]); opens a new AST list and writes it in the construct corresponding to top, and generates in each entry of the stack the 'rule number + position + target list'.

When you parsing is finished, you have the entire tree.

In a precise AST, you might want to ignore some tokens. This can be done via appropriate annotations in the stack set during the push() part. The typical way is to attach to each of your rules (A -> B C) some meta information, for example, what is to be kept and what is the nature of the result.

Question 2

The difficulty arises because the common method of replacing a nonterminal on the stack with the rhs of its matched-rule effectively forgets the grammatical structure at the moment it's predicted. But to generate an AST you need that structure later when a rule-parse is completed.

Rather than replacing a nonterminal with the rhs symbols of its matching rule, leave it in place and push the matched symbols as a list object. This way the stack retains the hierarchial structure of the grammar.

Parsing consumes symbols in the topmost list. The exhaustion of a list corresponds to the completion of a rule. A nonterminal is removed from the stack when its rule is completed, not when it is predicted.

As the stack is consumed, build a corollary AST structure that remembers the relevant rule and stores the parsed tokens. Thus the stack acts like a predicted AST that flows into the parsed AST.

You can think of this as emulating the call hierarchy of a recursive-descent parser with the stack of symbol-lists as a stack of call-frames.

In ruby:

# g is the grammar; a list of rules
# s is a terminal sequence to parse
# note, this code does not tokenize EOF

def parse(g, s)

 tab = gen_table(g)
 stack = [[g.start_sym]]
 # intermediate ast node format: [rule-action, symbols...]
 ast = [[->(rhs){[:_S, rhs[0]]}]]

 loop do
  puts "PARSE\n #{s}\n #{stack}\n #{ast}"

  if stack.first.empty?
   raise "extraneous input" if not s.empty?
   break
  end

  if stack.last.empty? # rule complete
   stack.pop
   node = ast.pop
   # transform the node (eg to a class) using the associated rule-action
   node = node.first.(node.drop(1))
   ast.last.push(node)
   stack.last.shift # rm sym from stack after completing it
   next
  end

  raise "incomplete input" if s.empty?
  tok = s.first
  topsym = stack.last.first

  if topsym.is_a? String # terminal
   raise "mismatch #{tok} != #{topsym}" if tok != topsym
   stack.last.shift
   ast.last.push(s.shift)

  elsif topsym.is_a? Symbol # nonterminal
   ri = tab[topsym][tok]
   raise "no rule for #{topsym}, #{tok}" if ri.nil?
   stack.push(g[ri].rhs.clone)
   ast.push([g[ri].action])
  end

 end

 node = ast.first
 node.first.(node.drop(1))
end