Question

I am aware of Regular Expression Denial of Service (ReDoS). Is there any reasonable way to allow users to create custom regexes while guaranteeing that they don't submit an exponentially slow pattern?

Was it helpful?

Solution

The problem with regular expressions is not the regex itself, but that the regex engine that has all kinds of “convenient” features like backtracking. Therefore, using a regex engine without these features avoids.

Regular expressions the computer science concept can always be matched in linear time after they are compiled to a finite state machine. So a state-machine based regex engine can't be used for ReDoS. However, the necessary state machines may become rather large in pathological examples. But limiting the available memory tends to be easier than limiting the available computation time.

The RE2 engine was developed specifically to deal with untrusted regexes and was designed for linear-time execution.

Another alternative is assembling the regexes yourself from a simplified notation. For example, you might allow users to use glob patterns (like *.txt). You can then parse that in a way that prevents backtracking, e.g. by disallowing nesting and only using greedy quantifiers. For many use cases, a simplified pattern notation is completely sufficient.

OTHER TIPS

Analyzing a regular expression to see whether it will be slow or not, without the analysis becoming slow itself, amounts to solving the halting problem. In other words, it's not possible to find a correct and complete solution.

You can, of course, find a solution that is correct and incomplete. For instance, you could work with a restrictive white list of features that are safe to use (e.g. character classes yes, repetition no...). This would allow you to pass a lot of uncritical regexes, reject all critical ones, and (wrongly) reject some that are okay, but too complicated to prove safe automatically.

As author of re parser for lazarus project I would say there are no ways to understand for any given regular expression what resources it will consume on a given text.

Without spending the same resources I mean (at least in big-O meaning).

So the best approach - run re parser in separate thread and kill it after timeout.

In addition to the other answers, a solution may also be to roll your own regex library, that allows performance instrumentation during execution, and thus providing the means to kill the execution halfway through if some criteria is met.

Similarly, you could run the regexes on another thread and kill the threads if they take too long.

Licensed under: CC-BY-SA with attribution
scroll top