Question

Everything I've read suggests Treetop backtracks like regular expressions, but I'm having a hard time making that work.

Suppose I have the following grammar:

grammar TestGrammar
  rule open_close
    '{' .+ '}'
  end
end

This does not match the string {abc}. I suspect that's because the .+ is consuming everything from the letter a onwards. I.e. it's consuming abc} when I only want it to consume abc.

This appears different from what a similar regex does. The regex /{.+}/ will match {abc}. It's my understanding that this is possible because the regex engine backtracks after consuming the closing } as part of the .+ and then failing to match.

So can Treetop do backtracking like that? If so, how?

I know you can use negation to match "anything other than a }." But that's not my intention. Suppose I want to be able to match the string {ab}c}. The tokens I want in that case are the opening {, a middle string of ab}c, and the closing }. This is a contrived example, but it becomes very relevant when working with nested expressions like {a b {c d}}.

Was it helpful?

Solution

Treetop is an implementation of a Parsing Expression Grammar parser. One of the benefits of PEGs is their combination of flexibility, speed, and memory requirements. However, this balancing act has some tradeoffs.

Quoting from the Wikipedia article:

The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression e, respectively. Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking. […] the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match.

(Emphasis mine.)

In short: while certain PEG operators can backtrack in an attempt to take another route, the + operator cannot.

Instead, in order to match nested sub-expressions, you want to create an alternation between the delimited sub-expression (checked first) followed by the non-expression characters. Something like (untested):

grammar TestGrammar
  rule open_close
    '{' contents '}'
  end
  rule contents
    open_close / non_brackets
  end
  rule non_brackets
    # …
  end
end
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top