Given a string and a CFG, what characters can follow the string (in the sentential forms of the CFG)?

https://cs.stackexchange.com/questions/555

16-10-2019
|

Question

Let $\Sigma$ be the set of terminal and $N$ the set of non-terminal symbols of some context-free grammar $G$.

Say I have a string $a \in (\Sigma \cup N)^+$ such that $x a y \in \mathcal{S}(G)$ where $x,y\in (\Sigma \cup N)^*$ and $\mathcal{S}(G)$ are the sentential forms of $G$.

Given $G$, I'd like to determine a set $C = \{ b \mid wabz \in \mathcal{S}(G), b \in \Sigma \cup N \}$.

To clarify, in this case, $w, x, y, z, a, b$ are strings of terminals and non-terminals and $b$ is of length one.

I can see how to do this if $a$ is also of length one; each $b$ is a member of the follow set of $a$ (including non-terminals).

However, I am curious if it's possible for a sequence of characters. For my application, the string $a$ is not much longer than the right hand side of the productions in $G$.

The distinction between terminals and non-terminals is somewhat mute in my application because I am using a generative grammar; and I believe that this won't lead to much trouble since $b$ is of length one.

Solution

I'll describe an algorithm that works. It's running time shouldn't be too bad. You can precompute quite a bit of this as well.

I'll assume that $a$ does not contain nonterminals (though it's probably easy to adapt to that case) and that you don't know $x$, $y$ or the derivation of $a$. I'll also assume your grammar doesn't contain productions that are never used in any derivation ($A \rightarrow A$ for example).

The main issue really is to parse $a$, as you want to know what kind of states you end up in, so you know what can follow $a$. This is not so easy as you don't know $x$.

We use an adaptation of Earley's algorithm. You'll want to understand that algorithm first. Our algorithm works in almost the same way, except our initialization and completion steps are different.

For the initialization, we seed our first set with an Earley item for every occurrence of $a_1$ (the first character in $a$) in any production of your grammar. We set the back pointer of this item to -1, an invalid value. This is important in our modified completion. Essentially, the -1 means 'I have no idea where this production was started'.

Now, we perform the Earley algorithm separately for every possible such initial Earley item. We can't simply do all of them at the same time, as the parses may interfere with each other. I can't easily see a faster method than backtracking here.

For the completion step, we only have to make a modification to handle -1 back pointers. As we've completed a production whose origin we don't know, we are in trouble. However, the method used to compute $LALR(1)$ lookahead sets by Pennello and DeRemer saves us: what we need here is exactly the $LALR(1)$ lookahead sets. Every item in these lookahead sets has a corresponding position in the grammar, which in turn corresponds to a possible continuation of the completed production.

Unfortunately, I don't really see any other choice than to backtrack once again here. For every position in the lookahead set, you perform the completion step with this position, and continue the parse from there. You do this separately for every parse. Note that if your grammar is $LALR(1)$, your lookahead will uniquely determine which position you have to go to, so you don't have to backtrack.

You continue the above algorithm one character beyond $a$, where you consider this extra, virtual character to be the 'any character', which immediately gives you the 'follow' set you are looking for - any time the scanner phase finds something for this final set, you can add this character to your answer set.

Edit: I think I've found the method that removes most of the overhead introduced by the backtracking. We associate with every Earley item a set of identifiers, which are strings, as we will need to use prefixes of these identifiers. On initialization, we add all initial items to the Earley set, and associate a unique identifier with every set.

On the scanner and predictor steps, identifiers are carried over to new items. Earley items in the same Earley set that only differ in their identifiers are merged together by merging their identifiers together. Note that we can perform scanner and predictor steps on these new items with identifiers, without having to perform this step for every identifier separately.

The completer considers the identifiers separately, and only completes an item if the corresponding item in the earlier itemset has an identifier that is a prefix of the identifier. For every possible completion (so for every item in the $LALR(1)$ lookahead set), we append a unique character to the identifiers of the completed items.

Essentially, we do the backtracking using these identifiers, so that we don't do double work in the scanner and predictor steps.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange