Question

I am currently looking into the problem of regular expressions which can end up running in exponential time when matched against a certain input, for example both (a*)* and (a|a)* potentially exhibit 'catastrophic backtracking' when matched against the string aaaaab - for every extra 'a' in the matched string, the time needed to attempt to match the string doubles. This is only the case if the engine uses a backtracking/NFA approach of attempting to try all possible branches in the tree before failing, such as that used in PCRE.

My question is, why isn't (a?)* vulnerable? Based on my understanding of backtracking, what should happen in the string "aaaab" is essentially what happens with (a|a)*. If we construct the NFA using the standard Thomspson NFA construction, surely for each epsilon transition that occurs, the engine will have to keep taking them and backtracking in the same way it would for the case of two a's? For example (omitting some steps and where @ replaces epsilon):

"aaaa" matches, but can't match 'b', fail (backtrack)
"aaaa@" matches, 'b' fail (backtrack)
"aaa@a" matches, 'b' fail (backtrack)
"aaa@a@" matches, 'b' fail (backtrack)
...
"@a@a@a@a@" matches, 'b' fails (backtrack)

trying all possible combinations of epsilons and a, surely leading to an exponential blowup of routes?

It would make sense to remove the epsilon transitions from the NFA, but I believe this has the effect of removing all non-determinism from the (a*)* pattern. This is definitely vulnerable though, so I'm not entirely sure what's going on!

Thank you very much in advance!

Edit: It has been pointed out by Qtax that epsilons can't still be present when the NFA is traversed with the traditional backtracking, otherwise (@)* will attempt to match forever. So what NFA implementation could possibly lead to (a*)* and (a|a)* being exponential, and (a?)* not being so? This is the crux of the question really.

Was it helpful?

Solution

Ok, after some sleuthing, I've eventually managed to find out that this is down to the use of 'barriers' in NFA implementations. Simply put, barriers are placed at strategic points in the NFA (such as on the node immediately after the 'a' transition in the NFA construction of a*). They require that the match has progressed on from the previous time that barrier was hit. This prevents the NFA from ever getting into a situation where it matches an infinite number of epsilons and allows it to terminate.

In other words, it is not possible to go from one barrier to the same barrier only matching e-moves - if this happened, the route is dropped and backtracking occurs from the previous point. This also has the side effect that (a?)* is not vulnerable to exponential blow-up, since the a? cannot match null the second time around.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top