Pergunta

I have a grammar with rules like this

A -> pq
B -> pr

Block -> { Astar Bstar }

Astar -> Astar A
       | epsilon

Bstar -> Bstar B
       | epsilon

Is there any way to turn this grammar into LALR(1)? From what I can make out, if the parser sees p inside a block, there's a shift/educe conflict.

Foi útil?

Solução

Your language is regular, and is equivalent to the regular expression \{(pq)*(pr)*\}. The problem is that any simple conversion to a grammar is going to require two-character lookahead to see whether there's a q or r after the p

Now you have this tagged both yacc and parsing so its not clear whether your looking for a practical answer of "how do you deal with this with yacc" or a theoretical answer "is there an LALR(1) grammar for this language."

For the practical answer, the solution is that you punt -- you move the recognition for A and B into the lexer where you recognize the character sequences pq and pr as the tokens A and B. Since lex/flex uses a DFA and backtracks to the longest matched token, they have no problem with arbitrary lookahead here.

For the theoretical answer, you need to transform the grammar to remove the need for lookahead. The problematic construct is (pq)*(pr)* (the braces are irrelevant) which is equivalent to p(qp)*(rp)*r | p(qp)*q | epsilon which suggests a grammar like:

Block -> { p Astar Bstar r }
       | { p Astar q }
       | { }
A -> q p
B -> r p
Astar -> Astar A | epsilon
Bstar -> Bstar B | epsilon

An alternate approach is unfactoring the star rules to make them not match empty strings:

Block -> { Aplus Bplus }
       | { Aplus }
       | { Bplus }
       | { }
A -> p q
B -> p r
Aplus -> Aplus A | A
Bplus -> Bplus B | B

giving you two very different LALR(1) grammars for the same language.

Outras dicas

Let's start off by seeing exactly why you're getting a LALR(1) conflict, then see if we can rework the grammar to make it LALR(1).

To see why the grammar isn't LALR(1), let's begin by computing the LR(1) configurating sets for the grammar:

(1)
S'     -> .Block [$]
Block  -> .{ Astar Bstar } [$]

(2)
S'     -> Block. [$]

(3)
Block -> { . Astar Bstar } [$]
Astar -> .Astar A [ }p ]
Astar -> .epsilon [ }p ]

(4)
Block -> { Astar . Bstar } [$]
Astar -> Astar .A [ }p]
A     -> .pq [}p]
Bstar -> .epsilon [ }p ]
Bstar -> . Bstar B [ }p ]

At this point, we can stop because we have a shift/reduce confict in state (4) on the symbol p: do you shift the p for A -> .pq [ {p ], or do you reduce BStar -> .epsilon [ }p ]? Since there's a shift/reduce conflict in the LR(1) grammar, the grammar is not LR(1) at all, meaning that it can't possibly be LALR(1) (because every LALR(1) grammar is also an LR(1) grammar).

Fundamentally, the issue is that when the parser sees a p, it can't tell whether it's looking at the start of an A (meaning that it needs to shift it) or if there are no more A's left and it's looking at the start of a B (meaning that it needs to reduce Bstar -> epsilon).

To fix this, let's see what happens if we make a small tweak. The issue we encountered is that the parser needs to immediately determine upon seeing a p whether to shift or reduce. What if we give it time to delay the decision by looking at the p and then the follow-up character? To do this, let's change your grammar slightly by rewriting

Bstar -> epsilon
Bstar -> Bstar B

as

Bstar -> epsilon
Bstar -> B Bstar

Now, the parser gets to look at more tokens before deciding what to do. If it's looking at pq, it knows that it isn't looking at anything B-related. If it sees pr, it knows it's looking at a B, and can therefore start doing productions of the second sort. Let's see what happens to our LR(1) states if we do this:

(1)
S'     -> .Block [$]
Block  -> .{ Astar Bstar } [$]

(2)
S'     -> Block. [$]

(3)
Block -> { . Astar Bstar } [$]
Astar -> .Astar A [ }p ]
Astar -> .epsilon [ }p ]

(4)
Block -> { Astar . Bstar } [$]
Astar -> Astar .A [ }p]
A     -> .pq [}p]
Bstar -> .epsilon [ } ]
Bstar -> . B Bstar [ } ]
B     -> .pr [}]

(5)
Block -> { Astar Bstar . } [$]

(6)
Block -> { Astar Bstar } . [$]

(7)
A     -> p.q [}p]
B     -> p.r [}]

(8)
A     -> .pq [}p]

(9)
B     -> pr. [}]

(10)
Bstar -> B . Bstar [ } ]
Bstar -> . B Bstar [ } ]
B     -> .pr [}]

(11)
B     -> p.r [}]

Notice that our original shift/reduce conflict has disappeared, and this new grammar no longer has any shift/reduce conflicts at all. Moreover, since there aren't any pairs of states with the same core, the above set of states is also the set of states we would have in our LALR(1) table. Therefore, the above grammar is indeed LALR(1), and we haven't changed the meaning of the grammar at all.

Hope this helps!

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top