How to convert finite automata to regular expressions?

https://cs.stackexchange.com/questions/2016

16-10-2019
|

Pergunta

Converting regular expressions into (minimal) NFA that accept the same language is easy with standard algorithms, e.g. Thompson's algorithm. The other direction seems to be more tedious, though, and sometimes the resulting expressions are messy.

What algorithms are there for converting NFA into equivalent regular expressions? Are there advantages regarding time complexity or result size?

^{This is supposed to be a reference question. Please include a general decription of your method as well as a non-trivial example.}

Solução

There are several methods to do the conversion from finite automata to regular expressions. Here I will describe the one usually taught in school which is very visual. I believe it is the most used in practice. However, writing the algorithm is not such a good idea.

State removal method

This algorithm is about handling the graph of the automaton and is thus not very suitable for algorithms since it needs graph primitives such as ... state removal. I will describe it using higher-level primitives.

The key idea

The idea is to consider regular expressions on edges and then removing intermediate states while keeping the edges labels consistent.

The main pattern can be seen in the following to figures. The first has labels between $p,q,r$ that are regular expressions $e,f,g,h,i$ and we want to remove $q$.

p-q-r automaton

Once removed, we compose $e,f,g,h,i$ together (while preserving the other edges between $p$ and $r$ but this is not displayed on this):

enter image description here

Example

Using the same example as in Raphael's answer:

1-2-3 automaton

we successively remove $q_2$:

1-3 automaton

and then $q_3$:

1 automaton

then we still have to apply a star on the expression from $q_1$ to $q_1$. In this case, the final state is also initial so we really just need to add a star:

$$ (ab+(b+aa)(ba)^*(a+bb))^* $$

Algorithm

L[i,j] is the regexp of the language from $q_i$ to $q_j$. First, we remove all multi-edges:

for i = 1 to n:
  for j = 1 to n:
    if i == j then:
      L[i,j] := ε
    else:
      L[i,j] := ∅
    for a in Σ:
      if trans(i, a, j):
        L[i,j] := L[i,j] + a

Now, the state removal. Suppose we want to remove the state $q_k$:

remove(k):
  for i = 1 to n:
    for j = 1 to n:
      L[i,i] += L[i,k] . star(L[k,k]) . L[k,i]
      L[j,j] += L[j,k] . star(L[k,k]) . L[k,j]
      L[i,j] += L[i,k] . star(L[k,k]) . L[k,j]
      L[j,i] += L[j,k] . star(L[k,k]) . L[k,i]

Note that both with a pencil of paper and with an algorithm you should simplify expressions like star(ε)=ε, e.ε=e, ∅+e=e, ∅.e=∅ (By hand you just don't write the edge when it's not $∅$, or even $ε$ for a self-loop and you ignore when there is no transition between $q_i$ and $q_k$ or $q_j$ and $q_k$)

Now, how to use remove(k)? You should not remove final or initial states lightly, otherwise you will miss parts of the language.

for i = 1 to n:
  if not(final(i)) and not(initial(i)):
    remove(i)

If you have only one final state $q_f$ and one initial state $q_s$ then the final expression is:

e := star(L[s,s]) . L[s,f] . star(L[f,s] . star(L[s,s]) . L[s,f] + L[f,f])

If you have several final states (or even initial states) then there is no simple way of merging these ones, other than applying the transitive closure method. Usually this is not a problem by hand but this is awkward when writing the algorithm. A much simpler workaround is to enumerate all pairs $(s,f)$ and run the algorithm on the (already state-removed) graph to get all expressions $e_{s,f}$ supposing $s$ is the only initial state and $f$ is the only final state, then doing the union of all $e_{s,f}$.

This, and the fact that this is modifying languages more dynamically than the first method make it more error-prone when programming. I suggest using any other method.

Cons

There are a lot of cases in this algorithm, for example for choosing which node we should remove, the number of final states at the end, the fact that a final state can be initial, too etc.

Note that now that the algorithm is written, this is a lot like the transitive closure method. Only the context of the usage is different. I do not recommend implementing the algorithm, but using the method to do that by hand is a good idea.

Outras dicas

Method

The nicest method I have seen is one that expresses the automaton as equation system of (regular) languages which can be solved. It is in particular nice as it seems to yield more concise expressions than other methods.

Let $A= (Q,\Sigma,\delta,q_0,F)$ an NFA without $\varepsilon$-transitions. For every state $q_i$, create the equation

$\qquad \displaystyle Q_i = \bigcup\limits_{q_i \overset{a}{\to} q_j} aQ_j \cup \begin{cases} \{\varepsilon\} &,\ q_i \in F \\ \emptyset &, \text{ else}\end{cases}$

where $F$ is the set of final states and $q_i \overset{a}{\to} q_j$ means there is a transition from $q_i$ to $q_j$ labelled with $a$. If you read $\cup$ as $+$ or $\mid$ (depending on your regular expression definition), you see that this is an equation of regular expressions.

For solving the system you need associativity and distributivity of $\cup$ and $\cdot$ (string concatenation), commutativity of $\cup$ and Arden's Lemma¹:

Let $L,U,V \subseteq \Sigma^*$ regular languages with $\varepsilon \notin U$. Then,

$\qquad \displaystyle L = UL \cup V \quad \Longleftrightarrow \quad L = U^*V$

The solution is a set of regular expressions $Q_i$, one for every state $q_i$. $Q_i$ describes exactly those words that can be accepted by $A$ when started in $q_i$; therefore $Q_0$ (if $q_0$ is the initial state) is the desired expression.

Example

^{For the sake of clarity, we denotate singleton sets by their element, i.e. $a = \{a\}$. The example is due to Georg Zetzsche.}

Consider this NFA:

example nfa
^[source]

The corresponding equation system is:

$\qquad \begin{align} Q_0 &= aQ_1 \cup bQ_2 \cup \varepsilon \\ Q_1 &= bQ_0 \cup aQ_2 \\ Q_2 &= aQ_0 \cup bQ_1 \end{align}$

Now plug the third equation into the second:

$\qquad \begin{align} Q_1 &= bQ_0 \cup a(aQ_0 \cup bQ_1) \\ &= abQ_1 \cup (b \cup aa)Q_0 \\ &= (ab)^*(b \cup aa)Q_0 \end{align}$

For the last step, we apply Arden's Lemma with $L = Q_1$, $U = ab$ and $V = (b \cup aa) \cdot Q_0$. Note that all three languages are regular and $\varepsilon \notin U = \{ab\}$, enabling us to apply the lemma. Now we plug this result into the first equation:

$\qquad \begin{align} Q_0 &= a(ab)^*(b \cup aa)Q_0 \cup baQ_0 \cup bb(ab)^*(b \cup aa)Q_0 \cup \varepsilon \\ &= ((a \cup bb)(ab)^*(b \cup aa) \cup ba)Q_0 \cup \varepsilon \\ &= ((a \cup bb)(ab)^*(b \cup aa) \cup ba)^* \qquad \text{(by Arden's Lemma)} \end{align}$

Thus, we have found a regular expression for the language accepted by above automaton, namely

$\qquad \displaystyle ((a + bb)(ab)^*(b + aa) + ba)^*.$

Note that it is quite succinct (compare with the result of other methods) but not uniquely determined; solving the equation system with a different sequence of manipulations leads to other -- equivalent! -- expressions.

For a proof of Arden's Lemma, see here.

Brzozowski algebraic method

This is the same method as the one described in Raphael's answer, but from a point of view of a systematic algorithm, and then, indeed, the algorithm. It turns out to be easy and natural to implement once you know where to begin. Also it may be easier by hand if drawing all the automata is impractical for some reason.

When writing an algorithm you have to remember that the equations must be always linear so that you have a good abstract representation of the equations, thing that you can forget when you are solving by hand.

The idea of the algorithm

I won't describe how it works since it is well done in Raphael's answer which I suggest to read before. Instead, I focus on in which order you should solve the equations without doing too many extra computations or extra cases.

Starting from Arden's rule's ingenious solution $X=A^*B$ to the language equation $X=AX∪B$ we can consider the automaton as a set of equations of the form:

$$X_i = B_i + A_{i,1}X_1 + … + A_{i,n}X_n$$

we can solve this by induction on $n$ by updating the arrays $A_{i,j}$ and $B_{i,j}$ accordingly. At the step $n$, we have:

$$X_n = B_n + A_{n,1}X_1 + … + A_{n,n}X_n$$

and Arden's rule gives us:

$$X_n = A_{n,n}^* (B_n + A_{n,1}X_1 + … + A_{n,n-1}X_{n-1})$$

and by setting $B'_n = A_{n,n}^* B_n$ and $A'_{n,i}=A_{n,n}^*A_{n,i}$ we get:

$$X_n = B'_n + A'_{n,1}X_1 + … + A'_{n,n-1}X_{n-1}$$

and we can then remove all needs of $X_n$ in the system by setting, for $i,j<n$:

$$B'_i = B_i + A_{i,n}B'_n$$ $$A'_{i,j} = A_{i,j} + A_{i,n}A'_{n,j}$$

When we have solved $X_n$ when $n=1$, we obtain a equation like this:

$$X_1 = B'_1$$

with no $A'_{1,i}$. Thus we got our regular expression.

The algorithm

Thanks to this, we can build the algorithm. To have the same convention than in the induction above, we will say that the initial state is $q_1$ and that the number of state is $m$. First, the initialization to fill $B$:

for i = 1 to m:
  if final(i):
    B[i] := ε
  else:
    B[i] := ∅

and $A$:

for i = 1 to m:
  for j = 1 to m:
    for a in Σ:
      if trans(i, a, j):
        A[i,j] := a
      else:
        A[i,j] := ∅

and then the solving:

for n = m decreasing to 1:
  B[n] := star(A[n,n]) . B[n]
  for j = 1 to n:
    A[n,j] := star(A[n,n]) . A[n,j];
  for i = 1 to n:
    B[i] += A[i,n] . B[n]
    for j = 1 to n:
      A[i,j] += A[i,n] . A[n,j]

the final expression is then:

e := B[1]

Implementation

Even if it may seems a system of equations that seems too symbolic for an algorithm, this one is well-suited for an implementation. ~~Here is an implementation of this algorithm in Ocaml~~ (broken link). Note that apart from the function brzozowski, everything is to print or to use for Raphael's example. Note that there is a surprisingly efficient function of simplification of regular expressions simple_re.

Transitive closure method

This method is easy to write in a form of an algorithm, but generates absurdly large regular expressions and is impractical if you do it by hand, mostly because this is too systematic. It is a good and simple solution for an algorithm though.

The key idea

Let $R^k_{i,j}$ represent the regular expression for the strings going from $q_i$ to $q_j$ using the states $\{q_1, …, q_k\}$. Let $n$ be the number of states of the automaton.

Suppose that you know already the regular expression $R_{i,j}$ from $q_i$ to $q_j$ without the intermediate state $q_k$ (except for extremities), for all $i,j$. Then you can guess how adding another state will affect the new regular expression $R'_{i,j}$: it changes only if you have direct transitions to $q_k$, and it can be expressed like that:

$$R'_{i,j} = R_{i,j} + R_{i,k} . R_{k,k}^* . R_{k,j}$$

($R$ is $R^{k-1}$ and $R'$ is $R^k$.)

Example

We will use the same example as in Raphael's answer. At first, you can only use the direct transitions.

Here is the first step (note that a self loop with a label $a$ would have transformed the first $ε$ into $(ε+a)$.

$$R^0 = \begin{bmatrix} ε & a & b \\ b & ε & a \\ a & b & ε \end{bmatrix} $$

At the second step we can use $q_0$ (which is renamed into $q_1$ for us, because $R^0$ is already used for the purpose above). We will see how $R^1$ works.

From $q_2$ to $q_2$: $R^1_{2,2} = R^0_{2,2} + R^0_{2,1} {R^0_{1,1}}^* R^0_{1,2} = ε + b ε^* a = ε + ba$.

Why is that? It is because going from $q_2$ to $q_2$ using only $q_1$ as an intermediate state can be done by staying here ($ε$) or going to $q_1$ ($a$), looping there ($ε^*$) and coming back ($b$).

$$R^1 = \begin{bmatrix} ε & a & b \\ b & ε+ba & a+bb \\ a & b+aa & ε+ab \end{bmatrix} $$

You can compute like that $R^2$ and $R^3$, too, and $R^3_{1,1}$ will give you the final expression because $1$ is both initial and final. Note that a lot of simplification of expressions have been done here. Otherwise the first $a$ of $R^0$ would be $(∅+a)$ and the first $a$ of $R^1$ would be $((∅+a)+ε(ε)^*a)$.

Algorithm

Initialization:

for i = 1 to n:
  for j = 1 to n:
    if i == j:
      R[i,j,0] := ε
    else:
      R[i,j,0] := ∅
    for a in Σ:
      if trans(i, a, j):
        R[i,j,0] := R[i,j,0] + a

Transitive closure:

for k = 1 to n:
  for i = 1 to n:
    for j = 1 to n:
      R[i,j,k] := R[i,j,k-1] + R[i,k,k-1] . star(R[k,k,k-1]) . R(k,j,k-1)

Then the final expression is (supposing $q_s$ is the initial state):

e := ∅
for i = 1 to n:
  if final(i):
    e := e + R[s,i,n]

But you can imagine it generates ugly regular expressions. You can indeed expect things like $(∅)^*+(a+(∅)^*)(ε)^*(a + ∅)$ that represents the same language as $aa$. Note that simplifying a regular expression is useful in practice.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange