Left-Factoring a grammar into LL(1)

https://cs.stackexchange.com/questions/4862

16-10-2019
|

Question

I have a homework assignment where I need to convert a grammar into LL(1). I've already removed the left recursion, but I'm having trouble doing left-factoring. All of the examples I've found are simple, and look something like this:

A -> aX | aY
becomes:
A -> aZ
Z -> X | Y

I understand that. However, my grammar looks more like this:

X -> aE | IXE | (X)E
E -> IE | BXE | ϵ
I -> ++ | --
B -> + | - | ϵ

I'm not sure how to apply the simpler example to this. I've been trying for at least a couple of hours and I've lost track of all of the things I've tried. Generally, my attempts have looked something like this:

X  -> X' | IXE
X' -> aE | (X)E
E  -> IE | BIX'E | BX'E | ϵ

And I then try to convert the E rules into ones having only one production starting with + or -:

X  -> X' | IXE
X' -> aE | (X)E
B' -> + | -
E  -> IE | B'IX'E | IX'E | B'X'E | X'E | ϵ

And then...

X  -> X' | IXE
X' -> aE | (X)E
B' -> + | -
E  -> +P | -M | ϵ
P  -> +E | IX'E | +X'E | X'E
M  -> -E | IX'E | -X'E | X'E

And so on. But I continually end up with a lot of extra nonterminals, and some very long productions / chains of productions, without actually having left-factored it. I'm not sure how to approach this - I can't seem to eliminate some nonterminal having multiple productions starting with a + and with a -.

Solution

Let's have a look at your grammar:

$\qquad \begin{align} X &\to aE \mid IXE \mid (X)E \\ E &\to IE \mid BXE \mid \varepsilon \\ I &\to \text{++} \mid \text{--} \\ B &\to \text{+} \mid \text{-} \mid \varepsilon \end{align}$

Note that $X$ does not need left-factoring: all rules have disjoint FIRST sets¹. If you want to make this obvious, you can drop $I$ and inline it:

$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{++}E \mid \text{--}E \mid BXE \mid \varepsilon \\ B &\to \text{+} \mid \text{-} \mid \varepsilon \end{align}$

Similarly, we can inline $B$:

$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{++}E \mid \text{--}E \mid \text{+}XE \mid \text{-}XE \mid XE \mid \varepsilon \end{align}$

Now we see that we actually have to do left-factoring on $E$: we have obvious conflicts, and we get additional conflicts via $XE$. So, let's inline $X$ once at $XE$:

$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{++}E \mid \text{--}E \mid \text{+}XE \mid \text{-}XE \mid aEE \mid \text{++}XEE \mid \text{--}XEE \mid (X)EE \mid \varepsilon \end{align}$

And now we can left-factor as easily as in your example:

$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{+}P \mid \text{-}M \mid aEE \mid (X)EE \mid \varepsilon \\ P &\to \text{+}E \mid XE \mid \text{+}XEE \\ M &\to \text{-}E \mid XE \mid \text{-}XEE \end{align}$

By now we can see that we are not getting anywhere: by factoring away $\text{+}$ or $\text{-}$ from the alternatives, we dig up another $X$ which again has both $\text{+}$ and $\text{-}$ in its FIRST set.

So let's have a look at your language. Via

$\qquad \displaystyle X \Rightarrow aE \Rightarrow^* aI^n E \Rightarrow aI^nBXE$

and

$\qquad \displaystyle X \Rightarrow aE \Rightarrow^* aI^n E \Rightarrow aI^nIE$

you have arbitrarily long prefixes of the form $+^+$ which end differently, semantic-wise: an LL(1) parser can not decide whether any given (next) $\text{+}$ belongs to a pair -- which would mean choosing alternative $IE$ -- or comes alone -- which would mean choosing $BXE$.

In consequence, it doesn't look like your language can be expressed by any LL(1) grammar, so trying to convert yours into one is futile.

It's even worse: as $BXE \Rightarrow BIXEE \Rightarrow^* BI^n X E^n E$, you can not decide to chose $BXE$ with any finite look-ahead. This is not a formal proof, but it strongly suggests that your language is not even LL.

If you think about what you are doing -- mixing Polish notation with unary operators -- it is not very surprising that parsing should be hard. Basically, you have to count from the left and from the right to identify even a single $B$-$\text{+}$ in a long chain of $\text{+}$. If I think of multiple $B$-$\text{+}$ in a chain, I'm not even sure the language (with two semantically different but syntactically equal $\text{+}$) can be parsed deterministically (without backtracking) at all.

That would be the sets of terminals that can come first in derivations of a non-terminal/rule alternative.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange