Question

.*foo

Academically, that is a regular expression. A string aaaaaaaaaaaaafoo will match that. Also string aaaaaaaaaaaaafooaaaaafoo will match that too.

I would imagine that an NFA that solve that will have 4 states. State 0 will have an arrow going back to itself. Then an arrow f going to state 1, o to state 2, and another o to state 3.

The NFA would keep looping through to state 0 before it magically uses it's psychic power to decide aha, now I am going to start looking for foo. However, I have a hard time understanding psychic computer.

What would be a DFA for that? I would imagine that the number of state would probably be 6. So a the DFA may contain a boolean variable whether you are still looping or still going. I still can't quite figure that out. So can anyone help me construct a DFA for that?

Also .*foo, and it's variant .*?foo and .*+foo are actual regular expression used nowadays in non academic sense.

How would an actual regex engine implement those 3 expressions? How close are those actual implementation to actual DFA? I mean I know real DFA don't look back. I also know that real regex engine do look back and forth.

For simplicity sake let's presume that the DFA can contains variable instead of just states. I think the states of DFA are just variable in memories anyway.

Was it helpful?

Solution

With a simple conversion you can change a NFA to a DFA by taking each state of the DFA as a subset of the NFA states. The transition function would take each State of the input and get the set of next states and make the union of all of them. This models the computer being in multiple states at the same time.

As a result Your 4 state NFA would result in a 4 state DFA.

  1. A starting state {s0}. It has a transition with f to the next state {s0, s1} and any other loops back to itself.

  2. a state {s0, s1}. It has a transition with o to the next state {s0,s2}, a transition with f to itself and any other will go back to {s0}.

  3. a state {s0,s2} It has a transition with o to the next state {s0, s3} a transition with f to {s0, s1} and any other will go back to {s0}.

  4. and a finishing state {s0,s3} with a transition with f to {s0, s1} and all others back to {s0}

More state combinations (a total of 2^4) are possible but none of the others are reachable from the starting state.

This is the method that a few regex engines use to avoid backtracking in the NFA constructed from the regex string though it won't be able to model capturing groups and backrefs. It is essentially a breath-first search into the NFA.

Other regex engines (most actually) basically do a depth-first search in the constructed NFA. Each time multiple options are available it will pick one and try that, if it fails then it backtrack to that decision and tries the other path. Which path it picks first is decided by the greediness modifier.

Licensed under: CC-BY-SA with attribution
scroll top