What would be an actual DFA/actual regex engine solve this simple regex pattern?
https://softwareengineering.stackexchange.com/questions/302133
-
08-12-2020 - |
Question
.*foo
Academically, that is a regular expression. A string aaaaaaaaaaaaafoo
will match that. Also string aaaaaaaaaaaaafooaaaaafoo
will match that too.
I would imagine that an NFA that solve that will have 4 states. State 0 will have an arrow going back to itself. Then an arrow f going to state 1, o to state 2, and another o to state 3.
The NFA would keep looping through to state 0 before it magically uses it's psychic power to decide aha, now I am going to start looking for foo. However, I have a hard time understanding psychic computer.
What would be a DFA for that? I would imagine that the number of state would probably be 6. So a the DFA may contain a boolean variable whether you are still looping or still going. I still can't quite figure that out. So can anyone help me construct a DFA for that?
Also .*foo
, and it's variant .*?foo
and .*+foo
are actual regular expression used nowadays in non academic sense.
How would an actual regex engine implement those 3 expressions? How close are those actual implementation to actual DFA? I mean I know real DFA don't look back. I also know that real regex engine do look back and forth.
For simplicity sake let's presume that the DFA can contains variable instead of just states. I think the states of DFA are just variable in memories anyway.
Solution
With a simple conversion you can change a NFA to a DFA by taking each state of the DFA as a subset of the NFA states. The transition function would take each State of the input and get the set of next states and make the union of all of them. This models the computer being in multiple states at the same time.
As a result Your 4 state NFA would result in a 4 state DFA.
A starting state
{s0}
. It has a transition withf
to the next state{s0, s1}
and any other loops back to itself.a state
{s0, s1}
. It has a transition witho
to the next state{s0,s2}
, a transition withf
to itself and any other will go back to{s0}
.a state
{s0,s2}
It has a transition witho
to the next state{s0, s3}
a transition withf
to{s0, s1}
and any other will go back to{s0}
.and a finishing state
{s0,s3}
with a transition withf
to{s0, s1}
and all others back to{s0}
More state combinations (a total of 2^4) are possible but none of the others are reachable from the starting state.
This is the method that a few regex engines use to avoid backtracking in the NFA constructed from the regex string though it won't be able to model capturing groups and backrefs. It is essentially a breath-first search into the NFA.
Other regex engines (most actually) basically do a depth-first search in the constructed NFA. Each time multiple options are available it will pick one and try that, if it fails then it backtrack to that decision and tries the other path. Which path it picks first is decided by the greediness modifier.