How to prove that a language is not regular?

https://cs.stackexchange.com/questions/1031

16-10-2019
|

Question

We learned about the class of regular languages $\mathrm{REG}$. It is characterised by any one concept among regular expressions, finite automata and left-linear grammars, so it is easy to show that a given language is regular.

How do I show the opposite, though? My TA has been adamant that in order to do so, we would have to show for all regular expressions (or for all finite automata, or for all left-linear grammars) that they can not describe the language at hand. This seems like a big task!

I have read about some pumping lemma but it looks really complicated.

^{This is intended to be a reference question collecting usual proof methods and application examples. See here for the same question on context-free languages.}

Solution

Proof by contradiction is often used to show that a language is not regular: let $P$ a property true for all regular languages, if your specific language does not verify $P$, then it's not regular. The following properties can be used:

The pumping lemma, as exemplified in Dave's answer;
Closure properties of regular languages (set operations, concatenation, Kleene star, mirror, homomorphisms);
A regular language has a finite number of prefix equivalence class, Myhill–Nerode theorem.

To prove that a language $L$ is not regular using closure properties, the technique is to combine $L$ with regular languages by operations that preserve regularity in order to obtain a language known to be not regular, e.g., the archetypical language $I= \{ a^n b^n | n \in \mathbb{N} \}$. For instance, let $L= \{a^p b^q | p \neq q \}$. Assume $L$ is regular, as regular languages are closed under complementation so is $L$'s complement $L^c$. Now take the intersection of $L^c$ and $a^\star b^\star$ which is regular, we obtain $I$ which is not regular.

The Myhill–Nerode theorem can be used to prove that $I$ is not regular. For $p \geq 0 $, $I/a^p= \{ a^{r}b^rb^p| r \in \mathbb{N} \}=I.\{b^p\}$. All classes are different and there is a countable infinity of such classes. As a regular language must have a finite number of classes $I$ is not regular.

OTHER TIPS

Based on Dave's answer, here is a step-by-step "manual" for using the pumping lemma.

Recall the pumping lemma (taken from Dave's answer, taken form Wikipedia):

Let $L$ be a regular language. Then there exists an integer $n\ge 1$ (depending only on $L$) such that every string $w$ in $L$ of length at least $n$ ($n$ is called the "pumping length") can be written as $w = xyz$ (i.e., $w$ can be divided into three substrings), satisfying the following conditions:

$|y| \ge 1$

$|xy| \le n$ and

a "pumped" $w$ is still in $L$: for all $i \ge 0$, $xy^iz \in L$.

Assume that you are given some language $L$ and you want to show that it is not regular via the pumping lemma. The proof looks like this:

Assume that $L$ is regular.
If it is regular, then the pumping lemma says that there exists some number $n$ which is the pumping length.
Pick a specific word $w\in L$ of length larger than $n$. The difficult part is to know which word to take.
Consider ALL the ways to partition $w$ into 3 parts, $w=xyz$, with $|xy|\le n$ and $y$ non empty. For each of these ways, show that it cannot be pumped: there always exists some $i\ge 0$ such that $xy^iz \notin L$.
Conclude: the word $w$ cannot be "pumped" (no matter how we split it to $xyz$) in contradiction to the pumping lemma, i.e., our assumption (step 1) is wrong: $L$ is not regular.

Before we go to an example, let me reiterate Step 3 and Step 4 (this is where most of the people go wrong). In Step 3 you need to pick one specific word in $L$. write it down explicitly, like "00001111" or "$a^nb^n$". Examples for things that are not a specific word: "$w$" or "a word that has 000 as a prefix".

On the other hand, in Step 4 you need to consider more than one case. For instance, if $w=000111$ it is not enough to say $x=00, y=01, z=00$, and then reach a contradiction. You must also check $x=0, y=0, z=0111$, and $x=\epsilon, y=000, z=111$, and all the other possible options.

Now let's follow the steps and prove that $L= \{ 0^k1^{2k} \mid k>0 \}$ is not regular.

Assume $L$ is regular.
Let $n$ be the pumping length given by the pumping lemma.
Let $w = 0^n 1^{2n}$.
(sanity check: $|w|\gt n$ as needed. Why this word? other words can work as well.. it takes some experience to come up with the right $w$). Again, note that $w$ is a specific word: $\underbrace{000\ldots0}_{n \text{ times}}\underbrace{111\ldots1}_{2n \text{ times}}$.
Now lets start consider the various cases to split $w$ into $xyz$ with $|xy|\le n$ and $|y|>0$. Since $|xy|<n$ no matter how we split $w$, $x$ will consist of only 0's and so will $y$. Lets assume $|x|=s$ and $|y|=k$. We need to consider ALL the options, that is all the possible $s,k$ such that $s\ge 0, k\ge 1$ and $s+k \le n$. FOR THIS $L$ the proof for all these cases is the same, but in general it might be different.
take $i=0$ and consider $xy^iz = xz$. this word is NOT in $L$ since it is of the form $0^{n-k}1^{2n}$ (no matter what $s$ and $k$ were), and since $k \ge 1$, this word is not in $L$ and we reach a contradiction.
Thus, our assumption is incorrect, and $L$ is not regular.

A youtube clip that explains how to use the pumping lemma along the same lines can be found here

From Wikipedia, the pumping language for regular languages is the following:

Let $L$ be a regular language. Then there exists an integer $p\ge 1$ (depending only on $L$) such that every string $w$ in $L$ of length at least $p$ ($p$ is called the "pumping length") can be written as $w = xyz$ (i.e., $w$ can be divided into three substrings), satisfying the following conditions:

$|y| \ge 1$

$|xy| \le p$ and

for all $i \ge 0$, $xy^iz \in L$.
$y$ is the substring that can be pumped (removed or repeated any number of times, and the resulting string is always in $L$).

(1) means the loop y to be pumped must be of length at least one; (2) means the loop must occur within the first p characters. There is no restriction on x and z.

In simple words, For any regular language L, any sufficiently long word $w\in L$ can be split into 3 parts. i.e $w = xyz$, such that all the strings $xy^kz$ for $k\ge 0$ are also in $L$.

Now let's consider an example. Let $L=\{(01)^n2^n\mid n\ge0\}$.

To show that this is not regular, you need to consider what all the decompositions $w=xyz$ look like, so what are all the possible things x, y and z can be given that $xyz=(01)^p2^p$ (we choose to look at this particular word, of length $3p$, where $p$ is the pumping length). We need to consider where the $y$ part of the string occurs. It could overlap with the first part, and will thus equal either $(01)^{k+1}$, $(10)^{k+1}$, $1(01)^k$ or $0(10)^k$, for some $k\ge 0$ (don't forget that $|y|\ge 1$). It could overlap with the second part, meaning that $y=2^k$, for some $k>0$. Or it could overlap across the two parts of the word, and will have the form $(01)^{k+1} 2^l$, $(10)^{k+1} 2^l$, $1(01)^k 2^l$ or $0(10)^k 2^l$, for $k\ge0$ and $l\ge1$.

Now pump each one to obtain a contradiction, which will be a word not in your language. For example, if we take $y=0(10)^k2^l$, the pumping lemma says, for instance, that $xy^2z=x0(10)^k2^l0(10)^k2^lz$ must be in the language, for an appropriate choice of $x$ and $z$. But this word cannot be in the language as a $2$ appears before a $1$.

Other cases will result in the number of $(01)$'s being more than the number of $2$'s or vice versa, or will result in words that won't have the structure $(01)^n2^n$ by, for example, having two $0$'s in a row.

Don't forget that $|xy| \le p$. Here, it's useful to shorten the proof: many of the decompositions above are impossible because they would make the $z$ part too long.

Each of the cases above needs to lead to such a contradiction, which would then be a contradiction of the pumping lemma. Voila! The language would not be regular.

For a given language $L \subseteq \Sigma^*$, let

$\qquad \displaystyle S_L(z) = \sum\limits_{n \geq 0} |L \cap \Sigma^n|\cdot z^n$

the (ordinary) generating function of $L$, i.e. its sequence of word counts per length.

The following statement holds [FlSe09, p52]:

$\qquad \displaystyle L \in \mathrm{REG} \quad \Longrightarrow \quad S_L \text{ rational}$

That is, $S_L(z) = \frac{P(z)}{Q(z)}$ with $P,Q$ polynomials.

So any language whose generating function is not rational is not regular. Unfortunately, all linear languages also have rational generating functions¹ so this method won't work for the simpler non-regular languages. Another drawback is that obtaining $S_L$ (and showing that it is not rational) can be hard.

Example: Consider the language of correctly nested parentheses words, i.e. the Dyck language. It is generated by the unambiguous grammar

$\qquad \displaystyle S \to [S]S \mid \varepsilon$

which can be translated into the equation

$\qquad \displaystyle S(z) = z^2S^2(z) + 1$

one solution (the one with all positive coefficients) of which is

$\qquad \displaystyle \mathcal{S}(z) = \frac{1 - \sqrt{1 - 4z^2}}{2z^2}$.

As $S_L = \mathcal{S}$ [Kuic70] and $\mathcal{S}$ is not rational, the Dyck language is not regular.

The proof for the statement for regular languages works via grammars and transfers to linear grammars immediately (commutativity of multiplication).

$\ \ $ [FlSe09] Analytic Combinatorics by P. Flajolet and R. Sedgewick (2009)
$\ \ $ [Kuic70] On the Entropy of Context-Free Languages by W. Kuich (1970)

This is an expanded version of my answer from here Using Pumping Lemma to prove language $L = \{(01)^m 2^m \mid m \ge0\}$ is not regular since this is supposed to be a reference question.

So, you think the pumping lemma looks complicated? Don't worry. Here's a slightly different take approach, which is hidden in @Romuald's answer as well. (Quiz: where?)

Let's start by remembering that every regular language is accepted by a deterministic finite state automaton (DFA). A DFA is a finite directed graph where every vertex has exactly one out-edge for each letter in the alphabet. Strings give you a walk in the graph based at a vertex labeled "start", and the DFA accepts if this walk ends at a vertex labeled "accept". (The vertices are called "states" because different areas of math like to make up their own terminology for the same thing.)

With this way of thinking it is easy to see that: If strings $a$ and $b$ drive the DFA to the same state, then for any other string $c$, $ac$ and $bc$ drive the DFA to the same state. Why? Because the stating point of a walk and the string defining it determine the end completely.

Put slightly differently: If $L$ is regular and strings $a$ and $b$ drive a recognizing automaton to the same state, then for all strings $c$, either $ac$ and $bc$ are both in $L$ or neither is.

We can use this to show languages aren't regular by imagining it is and then coming up with $a$ and $b$ driving an DFA to the same state, and $c$ so that $ac$ is in the language and $bc$ isn't. Take the example language from @Dave's answer. Imagine it is regular, so it has some recognizing DFA with $m$ states. The Pigeon Hole Principle says that at least two of $\{(01)^i : 0\le i\le m+1\}$ send the DFA to the same state, say $a=(01)^p$ and $b=(01)^q$. Since $p\neq q$, we see that $a2^p$ is in the language and $b2^p$ is not, so this language can't be regular.

The nice thing is that the example is really a template for proving that languages aren't regular:

Find a family of strings $\{a_i :i\in\mathbb{N}\}$ with the property that each of them has a "tail" $t_i$ so that $a_it_i$ is in the language and $a_it_j$, for $i\neq j$ is not.
Apply the argument above verbatim. (This is allowed, since there are always enough $a_i$ to let you invoke the Pigeon Hole Principle.)

There are other tricks, but this one will work easily on most of your homework problems.

Edit: An earlier version had some discussion of how this idea relates to the Pumping Lemma.

Following the answer here, I will describe a method of proving non-regularity based on Kolmogorv complexity.

This approach is discussed in "A New Approach to Formal Language Theory by Kolmogorov Complexity", by Ming Li and Paul M.B. Vitanyi (see section 3.1).

Let $K(x)$ denote the Kolmogorov complexity of a string $x$, i.e. the length of the shortest encoding of a Turing machine $M$, such that $M(\epsilon)=x$ (any of the usual definitions will do). One can then use the following lemma to prove non regularity:

KC-Regularity: Let $L\subseteq \Sigma^*$ be a regular language, then there exists a constant $c$ which depends only on $L$, such that for all $x\in\Sigma^*$, If $y$ is the $n'th$ string (relative to the lexicographic ordering) in $L_x=\left\{y\in \Sigma^*|xy\in L\right\}$, then $K(y)\le O(\log n)+c$.

One can understand (and prove) the above lemma as follows, for any $x\in\Sigma^*$, to describe the $n'th$ string in $L_x$ one needs to specify:

The automaton which accepts $L$
The state in the automaton after processing the prefix $x$
The index $n$

Since we only need to remember the state after processing $x$, and not $x$ itself, we can hide this factor in the constant depending on $L$. The index $n$ requires $\log n$ bits to describe, and we get the above result (for completeness, one needs to add the specific instructions required to generate $y$, but this only adds a constant factor to the final description).

This lemma shows how to bound the Kolmogorov complexity of all strings which are members of $L_x$ for some regular language $L$ and $x\in\Sigma^*$. In order to show non regularity, one can assume $L$ is regular, and prove that the bounds are too restrictive (e.g. bounded Kolmogrov complexity for an infinite set of strings).

The answer linked above contains an example of how to use this lemma to show $L=\left\{1^p | \text{p is prime}\right\}$ is not regular, several more examples are given in the paper. For completeness, we show here how to prove $L=\left\{0^n1^n| n\ge 0\right\}$ is not regular.

Given some $x\in\left\{0,1\right\}^*$, we denote by $y_i^x$ the $i'th$ word in $L_x$. Note that $y_1^{0^i}=1^i$. Using the above lemma, focusing on prefixes $x$ of the form $x=0^i$ and fixing $n=1$, we obtain $\forall i\ge 0 : K(y_1^{0^i})\le c$. Since $y_1^{0^i}=1^i$, this means that we can bound the Kolmogorov complexity of all strings of the form $1^i$ by a constant, which is obviously false. It is worth mentioning that we could have examined a single $x$, e.g. $x=0^n$ for large enough $n$ which satisfies $K(0^n)\ge \log n $ (we start with a high complexity prefix). Since $y_1^x=1^n$, we get $K(1^n)<c$, contradiction (suppose $n>2^c$).

In the case of unary languages (languages over an alphabet of size 1), there is a simple criterion. Let us fix an alphabet $\{ \sigma \}$, and for $A \subseteq \mathbb{N}$, define $$ L(A) = \{ \sigma^n : n \in A \}. $$

Theorem. Let $A \subseteq \mathbb{N}$. The following are equivalent:

$L(A)$ is regular.

$L(A)$ is context-free.

There exist $n_0,m \geq 1$ such that for all $n \geq n_0$, it holds that $n \in A$ iff $n+m \in A$. (We say that $A$ is eventually periodic.)

Let $a_i = 1_{i \in A}$. Then $0.a_0a_1a_2\ldots$ is rational.

The generating function $\sum_{i \in A} x^i$ is a rational function.

The theorem can be proved in many ways, for example using the pumping lemma, Myhill–Nerode theory, Parikh's theorem, the structure of DFAs on unary languages (they look like a "$\rho$", as in Pollard's $\rho$ algorithm), and so on. Here is a useful corollary.

Corollary. Let $A \subseteq \mathbb{N}$, and suppose that $L(A)$ is regular.

The limit $\rho = \lim_{n\to\infty} \frac{|A \cap \{1,\ldots,n\}|}{n}$ exists. (This is the asymptotic density of $A$.)

If $\rho = 0$ then $A$ is finite.

If $\rho = 1$ then $A$ is cofinite (that is, $\overline{A}$ is finite).

As an example, the language $L(\{2^n : n \geq 0\})$ isn't regular, since the set has vanishing asymptotic density, yet is infinite.

The class of regular languages is closured under various closure operations, such as union, intersection, complement, homomorphism, regular substitution, inverse homomorphism, and more. This can be used to prove that a given language is not regular by reduction to a language which is already known to be non-regular.

As a very simple example, suppose that we know that the language $\{a^nb^n : n \geq 0\}$ is not regular. Then we can prove that the language $\{w \in \{a,b\}^* : \#_a(w) = \#_b(w)\}$ (the language of all words with equally many $a$s and $b$s) is not regular as follows:

Suppose that $L = \{w \in \{a,b\}^* : \#_a(w) = \#_b(w)\}$ were regular. Then $L \cap a^*b^*$ would also be regular. But $L \cap a^*b^* = \{a^n b^n : n \geq 0\}$, which is known not to be regular.

Here is a more complicated example. Let us show that the language $L' = \{(0+1)^n2(0+1)^n : n \geq 0\}$ is not regular.

Let $h$ be the homomorphism mapping given by $h(0) = 0$, $h(1) = 1$, $h(2) = \epsilon$. If $L'$ were regular then so would the following language be: $h(L' \cap 0^*21^*) = \{ 0^n 1^n : n \geq 0 \}$. However, we know that the latter isn't regular.

Finally, here is an example using inverse homomorphism. Let us show that the language $L'' = \{0^n10^n : n \geq 0\}$ isn't regular.

Let $k$ be the homomorphism given by $k(0) = 0$, $k(1) = 0$, $k(2) = 1$. If $L''$ were regular then so would $k^{-1}(L'')$ be, but that is just the language $L'$ from the preceding example.

Use Myhill–Nerode theory.

Let $L$ be a language. We say that two words $x,y$ are inequivalent modulo $L$ (or: with respect to $L$) if there exists a word $z$ such that exactly one of $xz,yz$ is in $L$. In any DFA for $L$, $\delta(q_0,x) \neq \delta(q_0,y)$ (exercise). This implies the following criterion:

Let $L$ be a language. Suppose that there exists an infinite set of pairwise inequivalent words (that is, an infinite set $S$ such that any two non-equal $x,y \in S$ are inequivalent modulo $L$). Then $L$ is not regular.

Here is a simple example of applying this criterion:

The language $L = \{a^nb^n : n \geq 0\}$ isn't regular.

Proof. Let $S = \{ a^n : n \geq 0 \}$. We claim that any two different words in $S$ are inequivalent modulo $L$. Indeed, let $a^i,a^j \in S$, where $i \ne j$. Then $a^ib^i \in L$ but $a^ib^j \notin L$.

An important feature of this method is that it is guaranteed to succeed: if $L$ is not regular then there exists an infinite set of pairwise inequivalent words. This is a consequence of the Myhill–Nerode theorem. Briefly, equivalence modulo $L$ (the negation of inequivalence modulo $L$ defined above) is an equivalence relation, and a language $L$ is regular iff the number of equivalence class of equivalence modulo $L$ is finite. If $L$ is not regular, taking one word out of each equivalence classes would constitute an infinite set of inequivalent words.

Given a language $L$, for every string $x$ there is a set of strings $y$ such that $xy \in L$. Each such set could be used as a state in a state machine.

All you need to do is to show that the number of such sets is not finite.

As an example, let $L = {a^nb^n: n ≥ 0}$. Given $x = a^nb$ for some $n ≥ 1$, the only string $y$ such that $xy \in L$ is $y = b^{n-1}$. So for every $n$ we have a different set, which means $L$ is not regular.

So in general, if you find an infinite set of strings $x$ such that each $x$ gives a different set $\{y: xy \in L\}$ then the language cannot be recognised by a finite state machine, and therefore is not regular.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange