Generating random words by grammar

https://cs.stackexchange.com/questions/121884

29-09-2020
|

Question

A bit of context

I was writing a parser for a grammar, and for testing purposes I come up with idea to generate some random inputs. The grammar I was dealing with was much more complicated, in this question I presented "minimal working example" for simplicity. And of course, I am able to avoid the issue I faced using a bunch of trivial heuristics, but the question really makes me wonder.

The problem

Suppose we have a typical context-free grammar for arithmetic expressions of $+,*$, brackets, and integer literals:

$$E \longrightarrow F(“+”F)^*$$ $$F \longrightarrow T(“*”T)^*$$ $$T \longrightarrow int|“(”E“)”$$

It is easy to implement a staighforward algorithm for generating random words by this grammar: we implement a separate procedure for each nonterminal. If a nonterminal has multiple production rules (as $T$ does), we choose a production rule by tossing a coin. If rule contains kleene star (e.g. $(“+”F)^*$), we also toss a coin and generate zero or one repetition (sure we could pick any random integer $k\geq0$ and generate $k$ repetitions, but for simplicity we will focus on the simplest version of this procedure). Here is what we get:

generate_E():
    if coin_toss():
        return generate_F() + "+" + generate_F()
    else:
        return generate_F()

generate_F():
    if coin_toss():
        return generate_T() + "*" + generate_T()
    else:
        return generate_F()

def generate_T():
    if coin_toss():
        return "(" + generate_E() + ")"
    else:
        return random_integer()

An invocation of generate_E() yields a random expression.

What could go wrong? It turns out that execution of this procedure on the real machine terminates with stack overflow quite often. Of course, technically here we have a possibility of endless recursion, but my intuition was telling me that probability of reaching recursion depth $k$ decays exponentially with increasing $k$, therefore getting deep levels (let's say, 1000) is almost impossible. Apparently, a few consecutive runs reveals that procedure can easily reach depth of many thousands (by depth I mean maximum number of procedure calls the stack contains simultaneously).

I am curios how to formalize this empirical fact. I want either a formula for $P(depth = k)$, or an asymptotic approximation of it, or an inequality bounding right tail of CDF below (something like $P(depth > 1000) > 0.05$)

My attempt

I tried to come up with a formula for $P(depth = k)$:

Let's denote $P(depth = k)$ as $P_E(k)$. Also, we define similar values for generate_F() and generate_T() - $P_F(k)$ and $P_T(k)$ respectivetely.

Clearly (else branch of generate_T), $$P_T(1) = \frac{1}{2}$$ and for $k > 1$ (then branch)$$P_T(k) = \frac{1}{2}P_E(k - 1)$$

Regarding $P_F(k)$, we can either execute else branch, and it gives term $$\frac{1}{2}P_T(k - 1)$$, or then branch, what gives a bit more complicated one $$\frac{1}{2}\sum_{(x, y) | max(x, y) = k - 1}{P_T(x)P_T(y)}$$ i.e. $$P_F(k)= \frac{1}{2}(P_F(k - 1) + \sum_{(x, y) | max(x, y) = k - 1}{P_T(x)P_T(y)})$$

Finally, the formula for $P_E(k)$ is almost the same as for $P_F(f)$, we only have to replace $P_T(x)$ with $P_F(x)$.

Now, we can calculate some values of $P_e(k)$

\begin{array} {|r|r|}\hline k & P_E(k) & P_E(k)\text{ in decimal}& P_E(k)\text{ by Monte-Carlo} \\ \hline 3 & \frac{33}{128} & \approx0.257812 & \approx0.2527 \\ \hline 6 & \frac{4448787585}{34359738368} & \approx0.129477 & \approx0.1282 \\ \hline 9 & \frac{14080391757747154038821618051813151497122305}{178405961588244985132285746181186892047843328} & \approx0.078923 & \approx0.0761 \\ \hline 12 & \text{ the fraction is too long} & \approx0.053213 & \approx0.0530 \\ \hline \end{array}

As we can see, the recurrent formulas seem to be true, but neither they give me an insight on asymptotical behaviour of $P_E(k)$, nor the first values give a clue on formula in closed form.

Any help would be appreciated.

Solution

Your process is a textbook example of a branching process. Starting with one $E$, we have an expected $3/2$ many $F$s, $9/4$ many $T$s, and so $9/8$ many remaining $E$s in expectation. Since $9/8 > 1$, it is not surprising that your process often failed to terminate.

To gain more information, we need to know the exact distribution of the number of $E$-offsprings, which is given by the following generating function (see the Wikipedia article linked to above): $$ h(z) = \frac{33}{128} + \frac{7}{16}z+ \frac{15}{64}z^2 + \frac{1}{16}z^3 + \frac{1}{128}z^4. $$ This implies that the extinction probability is $d \approx 0.717778143742483$ (by solving $h(z) = z$). This is an upper bound on the probability that your procedure terminates.

We can easily recover your numbers by considering iterates of $h$. The probability that the process terminates in $3k$ steps is $h^{(k)}(0)$. So computing $h(0),h(h(0))-h(0),h(h(h(0)))-h(h(0))$ and so on, we recover the figures in your table.

When $k$ is large, $h^{(k)}(0) \approx d$. We can compute $h'(d) \approx 0.882115879977412$. We have $$ \frac{d - h^{(k)}(0)}{d - h^{(k-1)}(0)} = \frac{h(d) - h(h^{(k-1)}(0))}{d - h^{(k-1)}(0)} \approx h'(d). $$ It follows that $$ d - h^{(k)}(0) \propto h'(d)^k. $$ Therefore the probability that the process terminates in exactly $3k$ steps is $$ h^{(k)}(0) - h^{(k-1)}(0) = [d - h^{(k-1)}(0)] - [d - h^{(k)}(0)] \propto h'(d)^{k-1} - h'(d)^k \propto h'(d)^k. $$ Empirically, we can check that the constant of proportionality is roughly $0.0248011196615094$.

OTHER TIPS

As Yuval has noted, this way of randomly generating recursive data structures is known to (usually) end up with an infinite expected size.

There is, however, a solution to the problem, that allows one to weigh the recursive choices in such a way that expected size lies within a certain finite interval: Boltzmann samplers. They are based on the combinatorial generating function of the structure and come from combinatorial species theory. For practical implementations, you don't need the theoretical parts, though. A good programmatic introduction in Haskell can be found in Brent Yorgey's blog. If you can read (or decipher) Haskell, porting the approach to your own data structure isn't too difficult.

As an example for how a derivation like yours looks within that framework, read Counting and generating terms in the binary lambda calculus by Grygiel & Lescanne (spoiler: it's a surprising amount of complex analysis).

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange