Grammar spec resolving Shift/Reduce conflicts

https://stackoverflow.com//questions/12715961

13-12-2019
|

Question

I'm using Jison (Bison) to create a simple markup language. I'm clearly new to this, but slight variations are working very well. I just don't understand the source of the S/R conflict.

It doesn't seem matter that 'Text' is returned by two lexer actions (with different Start Conditions) and I like this because it seems to allow the grammar to have fewer rules and because the error messages to the user are consistent. I've tried making the 'Text' rule common regardless of context and I've also tried giving each token a different name, but it doesn't seem to have any effect on the S/R Conflicts when it's all together.

The parser is SUPPOSED to create a json-object with plain-text, sub-arrays, and various special nodes.

Specification:

/* lexical grammar */
%lex

%s bracketed

%%

<bracketed>(\\.|[^\\\,\[\]])+       { yytext = yytext.replace(/\\(.)/g, '$1'); return 'Text'; }
<INITIAL>(\\.|[^\\\[])+             { yytext = yytext.replace(/\\(.)/g, '$1'); return 'Text'; }
"["                                 { this.begin('bracketed'); return '['; }
"]"                                 { this.popState(); return ']'; }
","                                 return ','
<<EOF>>                             return 'END'

/lex

%start template

%%    

template
    : sentence END
    ;

sentence
    : /* empty */
    | sentence Text
    | sentence '[' ']'
    | sentence '[' dynamic ']'
    ;

dynamic
    : sentence
    /*| dynamic ',' sentence*/
    ;

Warnings:

Conflict in grammar: multiple actions possible when lookahead token is ] in state 5
- reduce by rule: sentence ->
- shift token (then go to state 6)

States with conflicts:
State 5
  sentence -> sentence [ .] #lookaheads= END Text [ ]
  sentence -> sentence [ .dynamic ] #lookaheads= END Text [ ]
  dynamic -> .sentence #lookaheads= ]
  sentence -> . #lookaheads= ] Text [
  sentence -> .sentence Text
  sentence -> .sentence [ ]
  sentence -> .sentence [ dynamic ]

Different generator algorithms have more or less trouble, but they all seem to have trouble.

Thanks!

Solution

The conflict comes fundamentally from these two rules:

sentence: sentence '[' Text ']'
        | sentence '[' sentenceList ']'

The reason is that after seeing a sentence and a [ and looking at the next token being Text, the parser doesn't known whether to shift the Text, matching the first rule, or to treat that Text as the beginning of a sentenceList going towards matching the second rule.

Now if you had a parser generator that use 2-token lookahead, this wouldn't be a problem, but bison is LALR(1) (the 1 being one token lookahead).

There are a couple of things you could try:

do extra lookahead in the lexer to differentiate Text-followed-by-] from Text-not-followed-by-] as two distinct tokens then rewrite the rules to use both of these tokens.
Use bison's %glr-parser feature to use GLR parser. This will parse the sentence both ways and later throw away the one that doesn't match
refactor the grammar to not need 2-token lookahead.

One refactoring that works in your case would be to rewrite the sentence rules to make them all right-recursive instead of left-recursive:

sentence: /* empty */
        | Text sentence 
        | '[' ']' sentence
        | '[' Text ']' sentence
        | '[' sentenceList ']' sentence
        ;

This avoids having sentence (or any other rule that starts with sentence such as sentenceList) start with a null reduction of the sentence: /*empty*/ rule. So the parser can freely shift a Text in the problematic case deferring the reduction until it sees the next token. It does have memory use implications, however, as it results in a parser that will essentially shift the entire input on to the parser stack and then reduce it one sentence at a time.

Another refactor you could do would be to subsume the [Text] and [] constructs into the [sentenceList]:

sentence: /* empty */
        | sentence Text 
        | sentence '[' sentenceList ']'
        ;

sentenceList: sentence
            | sentenceList ',' sentence

So now a sentenceList is one or more sentences separated by commas (instead of two or more), and in the action for the sentence '[' sentenceList ']' rule, you'd check the sentenceList to see if it was two or more sentences and act appropriately.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow