Better way to build lists of matches in parser generator

https://stackoverflow.com/questions/17536348

02-06-2022
|

Frage

(I use Yecc, an Erlang parser generator similar to Yacc, so the syntax is different from Yacc)

The problem is simple, say we want parse a lispy syntax, i want match on expressions-lists. An expression list is a list of expressions separated with blank space.

In Erlang, [1,3,4] is a list and ++ concatenates two lists.

We want match this 1 (1+2) 3. expression will match 1, (1+2) and 3. So, there i match on the list followed by one more expression, and if there is no match i end to match on a single expression. This builds a list recursively.

expressionlist -> expressionlist expression : '$1' ++ ['$2'].
expressionlist -> expression : ['$1'].

But i can do this too (invert the order):

expressionlist -> expression expressionlist : ['$1']  ++ '$2'.
expressionlist -> expression : ['$1'].

Both of theese seem to work, i would like to know if there was any difference.

With a separator

I want to match {name = albert , age = 43}. propdef matches name = value. So it is the same problem but with an additional separator ,. Is there any difference there from the first problem ?

proplist -> propdef ',' proplist  :  ['$1'] ++ '$3'.
proplist -> propdef  :  ['$1'].
proplist -> '{' proplist '}' :  '$2'.
proplist -> '{' '}' :  [].

%% Could write this
%% proplist -> proplist ',' propdef :  '$1' ++ ['$3'].

Thank you.

Lösung

Since Yecc is an LALR parser generator, the use of left recursion or right recursion doesn't matter much. In old times, people would prefer left recursion in Yacc/Bison and similar tools because it allows the parser to keep reducing instead of shifting everything onto the stack until the end of the list, but these days, stack space and speed isn't that important, so you can pick whatever suits you best. (Note that in an LL parser, left recursion causes an infinite loop, so for such parsers, right recursion is necessary.)

The more important thing, for your example above, is that '$1' ++ ['$2'] in the left recursive version will cause quadratic time complexity, since the "expressionlist" part is the one that's going to be longer and longer. You should never have the growing component on the left when you use the ++ operator. If you parse a list of thousands of elements, this complexity will hurt you. Using the right recursive version instead, ['$1'] ++ '$2' will give you linear time, even if the parser has to shift the whole list onto the stack before it starts reducing. You could try both versions and parse a really long list to see the difference.

Using a separator as in "propdef ',' proplist" does not change the problem.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow