Question

One can compress data with straight-line grammars. An algorithm that employs this technique is called Sequitur. If I understood correctly, Sequitur basically starts with one rule representing the input and then does these three steps till the grammar does not change anymore:

  1. For each rule, try to find any sequences of symbols in any other rule that match the rule's right hand side and replace these sequences by the rules left hand side.
  2. For each pair of adjacent symbols in any right hand side, find all non-overlapping other pairs of adjacent symbols that are equal to the original pair. If there are any other pairs, add a new nonterminal, replace all occurrences of these pairs by the new nonterminal and add a new rule that defines the nonterminal.
  3. For each nonterminal that appears exactly once on all right-hand sides of all rules, replace its occurrence by its definition, remove the nonterminal and the rule that defines it.

For each (non-empty) input, can one guarantee that the above algorithm terminates?

Was it helpful?

Solution

This is my idea for a proof although I am not quite sure whether it works:

Consider $r$, the number of symbols on all right-hand sides and $n$, the number of distinct nonterminals. Each grammar has a specific tuple $(r-n,n)$. Let's define an order on grammars. We first sort by $r-n$ and in case the values are equal by $n$.

Whenever step one fires, $r$ decreases as each rule has at least two symbols on its right hand side. Whenever step two fires, $n$ increases by one and $r$ stays the same (in case only two pairs are found) or decreases (if two or more pairs are found). If the third step fires, $r$ decreases by one and $n$ also decreases by one as one nonterminal is replaced by the whole right hand side of its definition which is removed. Let $G'$ be the tuple of the grammar (whose tuple is $G$) after the application of any of those steps, then:

  • step 1 always reduces $r-n$ by at least one, $G' < G$
  • step 2 always reduces $r-n$ by at least one, $G' < G$
  • step 3 does not change $r-n$ but reduces $n$ by one, $G' < G$

The application of each step reduces the tuple of $G$ to a lower tuple. What is left is to show, that both $r-n$ and $n$ have a lower bound. Per definition $n\ge1$. Since each nonterminal is defined in terms of exactly one rule, $n$ is also the number of rules. Since each rule has at least two symbols on it's right-hand side, $r\ge2n\to r>n\to r-n>0$. The tuples have a lower bound that is eventually reached. Thus the algorithm must eventually terminate.

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top