What are the possible sets of word lengths in a regular language?

https://cs.stackexchange.com/questions/164

16-10-2019
|

Question

Given a language $L$, define the length set of $L$ as the set of lengths of words in $L$: $$\mathrm{LS}(L) = \{|u| \mid u \in L \}$$

Which sets of integers can be the length set of a regular language?

Solution

First, an observation which is not crucial but convenient: the set $\mathscr{S}$ of sets of integers that are $LS(L)$ for some regular language $L$ on a non-empty alphabet $\mathscr{A}$ does not depend on the choice of alphabet. To see that, consider a finite automaton that recognizes $L$; the lengths of the words that are in $L$ are the lengths of the paths on the automaton seen as an unlabeled graph from the start state to any accept state. In particular, you can relabel every arrow to $a$ and get a regular language with the same length set over the alphabet $\{a\}$. Conversely, if $L$ is a regular language over a one-element alphabet, it can be trivially injected into a larger alphabet, and the result is still a regular language.

Therefore we are looking for the possible length sets for words over a singleton alphabet. On a singleton alphabet, the language is the length set written out in unary: $\mathrm{LS}(L) = \{n\in\mathbb{N} \mid a^n \in L\}$. Such languages are called unary languages.

Let $L$ be a regular language, and consider a deterministic finite automaton (DFA) that recognizes $L$. The set of lengths of words of $L$ is the set of lengths of paths in the DFA seen as a directed graph that start on the start state and end in one of the accept states. A DFA on a one-element alphabet is pretty tame (NFAs would be wilder): it's either a finite list or a circular list. If the list is finite, number the states from $0$ to $h$ following the list order; if it's circular, number the states from $0$ to $h$ following the head of the list, and $h$ to $h+r$ along the loop.

list-shaped automata

Let $F$ be the set of indices of accept states up to $h$, and $G$ be the set of indices of accept states from $h$ to $h+r$. Then

$$\mathrm{LS}(L) = F \cup \{ k \, r + x \mid x \in G, k\in\mathbb{N} \}$$

Conversely, let $h$ and $r$ be two integers and $F$ and $G$ be two finite sets of integers such that $\forall x \in F, x \le h$ and $\forall x \in G, h \le x \le h+r$. Then the set $L_{F,G,r} = \{ a^{k\,r+x} \mid x\in G, k\in\mathbb{N} \}$ is a regular language: it is the language recognized by the DFA described above. A regular expression that describes this language is $a^F \mid a^{G} (a^r)^*$.

To summarize in English, the length sets of regular languages are the sets of integers that are periodic¹ above a certain value.

¹ _{To hang on to a well-established notion, periodic means the characteristic function of the set (which is a function $\mathbb{N}\to\{\mathtt{false},\mathtt{true}\}$ which we lift to a function $\mathbb{Z}\to\{\mathtt{false},\mathtt{true}\}$) is periodic. Periodic above a certain value means that the function restricted to $[h,+\infty[$ can be prolonged into a periodic function.}

OTHER TIPS

Any finite subset $\{\ell_1,\ldots,\ell_n\}\subset\mathbb{N}$ can be the lenght-set of a regular language $L$, since you can take a unary alphabet $\{0\}$ and define $L$ as $\{0^{\ell_1},\ldots,0^{\ell_n}\}$ (this includes the empty language and $\{\varepsilon\}$).

Now for the infinite sets. I'll give a short analysis, though the final answer might not be explicit enough. I won't elaborate unless you ask me to, because I think it's intuitive and because I don't have much time now.

Let $r_1,r_2$ be regular expressions generating languages $L_1$ and $L_2$, respectively. It is (sort of) easy to see that

$\mathsf{LS}(L(r_1+r_2))=\mathsf{LS}(L_1\cup L_2)=\mathsf{LS}(L_1)\cup\mathsf{LS}(L_2)$.
$\mathsf{LS}(L(r_1r_2))=\mathsf{LS}(L_1L_2)=\{\ell_1+\ell_2:\ell_1\in\mathsf{LS}(L_1),\ell_2\in\mathsf{LS}(L_2)\}$. This is denoted $\mathsf{LS}(L_1)+\mathsf{LS}(L_2)$.
$$\mathsf{LS}(L(r_1^*))=\{0\}\cup\bigcup_{n\geq 1}\Big\{\sum_{i=1}^n\ell_i:(\ell_1,\ldots,\ell_n)\in\big(\mathsf{LS}(L_1)\big)^n\Big\}.$$

Thus, the possible sets of integers that can be the length-set of a regular language are the ones that are finite subsets of $\mathbb{N}$ or that can be built by taking finite subsets $S_1,S_2$ of $\mathbb{N}$ and using the previous formulas a finite number of times.

Here, we are using that regular languages are built, by definition, by applying the rules for constructing a regular expression a finite number of times. Note that we can start with any finite subset of $\mathbb{N}$, even though in regular expressions we start with words of length 0 and 1 only as the base case. This is easily justified by the fact that all (finite) words are (finite) concatenations of the symbols of the alphabet.

According to the pumping lemma for regular languages, there exists an $n$ such that a string $x$ of length at least equal to $n$ can be written in the following form: $$x = uvw$$ Where the following three conditions hold: $$|uv| < n$$ $$|v| > 0$$ $$uv^{k}w \in L$$

This gives us one test for sets: a set cannot be the length set of a regular language unless all its elements can be expressed as some arbitrary set of integers no greater than a fixed $n$, plus some multiple of an undetermined value $m$ (the length of $v$), plus some arbitrary finite value.

In other words, it looks like the possible sets of language lengths for regular languages is the closure with respect to set union (as discussed under EDIT and EDIT2, thanks to commenters) of sets described as follows: $$\{a + bn | n \in \mathbb{N}\} \cup S$$ For fixed $a, b \in \mathbb{N}$ and all finite sets $S$, by the pumping lemma for regular languages (thanks to Gilles for pointing out a silly mistake in my original version, whereby I was defining the set $\mathbb{N}$).

EDIT: A little more discussion. Certainly all finite sets of integers are length sets. Also, the union of two length sets must also be a length set, as must be the complement of any length set (hence intersection, hence difference). The reason for this is that the regular languages are closed under these operations. Therefore, the answer I give above is (possibly) incomplete; in reality, any union of such sets is also the length set of some regular language (note that I have abandoned requiring intersection, complement, difference, etc., since these are covered by the fact that regular languages are closed under these properties, as discussed in EDIT3; I think that only union is actually necessary, even if the others are right, which might not be the case).

EDIT2: Even more discussion. The answer I give is basically where you'd end up if you took Janoma's answer a little further; the $bn$ part comes from the Kleene star, the $a$ comes from concatenation, and the discussion of union, intersection, difference and complement come from the + of regular expressions (as well as other closure properties of regular languages) provable starting from automata).

EDIT3: In light of Janoma's comment, let's forget closure properties of language length sets that I discuss in the first EDIT. Since the regular languages have these closure properties, and since every regular language has a DFA, it follows that the pumping lemma for regular languages applies to all unions, intersections, complements, and differences of regular languages, and we'll leave it at that; no need to even consider any of these, except union, which I still think might be necessary to make my original (modified, thanks to input from Gilles) correct. So, my final answer is this: what I say in the original version, plus the closure of language length sets with respect to set union.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange