Question

Got a simple task to get a XPath expression and return a prefix that matches the parent of the node that (might be) selected.

Example:

/aaa/bbb       =>   /aaa
/aaa/bbb/ccc   =>   /aaa/bbb
/aaa/bbb/ccc[@x='1' and @y="/aaa[name='z']"] => /aaa/bbb

Because the patterns inside the square brackets might contain brackets within quotes, I decided to try to achieve this with the use of regular expressions. Here's a code snippet:

string input =
    "/aaa/bbb/ccc[@x='1' and @y=\"/aaa[name='z'] \"]";
                                            //  ^-- remove space for no loop
string pattern = @"/[a-zA-Z0-9]+(\[([^]]*(]"")?)+])?$";

System.Text.RegularExpressions.Regex re =
    new System.Text.RegularExpressions.Regex(pattern);
bool ismatch = re.IsMatch(input); // <== Infinite loop in here
// some code based on the match

Because the patterns are rather regular, I looked for '/' followed by indentifier followed by an optional group that matches at the end of the string (....)?$

The code seemd to work but playing with different values for the input string, I found that by simply inserting a space (in the location shown in the comment), the .NET IsMatch function gets into an infinite loop, taking all the CPU it gets.

Now regardless of whether this regular expression pattern is the best one (I had more complex but simplified it to show the problem), this seems to show that using RegEx with anything not trivial may be very risky.

Am I missing something? Is there a way to guard against infinite loops in regular expression matches?

Was it helpful?

Solution

Ok, let's break this down then:

Input: /aaa/bbb/ccc[@x='1' and @y="/aaa[name='z'] "]
Pattern: /[a-zA-Z0-9]+(\[([^]]*(]")?)+])?$

(I assume you meant \" in your C#-escaped string, not ""... translation from VB.NET?)

First, /[a-zA-Z0-9]+ will gobble up through the first square bracket, leaving:

Input: [@x='1' and @y="/aaa[name='z'] "]

The outer group of (\[([^]]*(]"")?)+])?$" should match if there is 0 or 1 instance before the EOL. So let's break inside and see if it matches anything.

The "[" gets gobbled right away, leaving us with:

Input: @x='1' and @y="/aaa[name='z'] "]
Pattern: ([^]]*(]")?)+]

Breaking down the pattern: match 0 or more non-] characters and then match "] 0 or 1 times, and keep doing this until you can't. Then try to find and gobble a ] afterward.

The pattern matches based on [^]]* until it reaches the ].

Since there's a space between ] and ", it can't gobble either of those characters, but the ? after (]") allows it to return true anyway.

Now we've successfully matched ([^]]*(]")?) once, but the + says we should attempt to keep matching it any number of times we can.

This leaves us with:

Input: ] "]

The problem here is that this input can match ([^]]*(]")?) an infinite of times without ever being gobbled up, and "+" will force it to just keep trying.

You're essentially matching "1 or more" situations where you can match "0 or 1" of something followed by "0 or 1" of something else. Since neither of the two subpatterns exists in the remaining input, it keeps matching 0 of [^]]\* and 0 of (]")? in an endless loop.

The input never gets gobbled, and the rest of the pattern after the "+" never gets evaluated.

(Hopefully I got the SO-escape-of-regex-escape right above.)

OTHER TIPS

The problem here is that this input can match ([^]]*(]")?) an infinite of times without ever being gobbled up, and "+" will force it to just keep trying.

That's one hell of a bug in .NET's RegEx implementation. Regular expressions just don't work like that. When you turn them into automata, you automatically get the fact that an infinite repetition of an empty string is still an empty string.

In other words, any non-buggy regex engine will execute this infinite loop instantly and continue with the rest of the regex.

If you prefer, regular expressions are such a limited language that it's possible (and easy) to detect and avoid such infinite loops.

It shows that using code with anything not trivial can be risky. You created code that can result in an infinite loop, and the RegEx compiler obliged. Nothing new that hasn't been done since the first 20 IF X=0 THEN GOTO 10.

If you're worried about this in a particular edge case, you could spawn a thread for RegEx and then kill it after some reasonable execution time.

To answer the original question (i.e. how to avoid infinite loop with regex), this has become easy with .Net 4.5 as you can simply pass a time out to the Regex methods. There is an internal timer that will stop the regex loop when the timeout expires and raise a RegexMatchTimeoutException

For example, you would do the following

string input = "/aaa/bbb/ccc[@x='1' and @y=\"/aaa[name='z'] \"]";
string pattern = @"/[a-zA-Z0-9]+(\[([^]]*(]"")?)+])?$";
bool ismatch = Regex.IsMatch(input, pattern, RegexOptions.None, TimeSpan.FromSeconds(5));

You can check out MSDN for more details

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top