Question

While building a simple regex I've found it to have a quite strange performance profile while input size increased.

Here is another really basic regex that has a similar behavior:

a+b

I've profiled it with a simple benchmark:

Regex regex = new Regex("a+b", RegexOptions.Compiled);

const int maxInputSize = 100;
const int n = 1000;

string input = "";
Stopwatch stopwatch = new Stopwatch();
for (int inputSize = 1; inputSize <= maxInputSize; ++inputSize)
{
    input += 'a';

    stopwatch.Restart();
    for (int i = 0; i < n; ++i)
    {
        regex.Match(input);
    }
    stopwatch.Stop();

    Console.WriteLine(stopwatch.Elapsed.Ticks);
}

It runs the regex on the strings "a", "aa", "aaa", ... and measure the time it took for each length of the string to do n matches.

I'm aware of the backtracking issue (e.g. if the regex was something like (a+a+)+b), but in this case even considering backtracking I expected a linear complexity.

As an example if we want to match n times 'a' here is the workflow I naively expected:

take first 'a'
take second 'a'
...
take last 'a'
ooops nothing more to take => backtracking
release one 'a' and try to match 'b', nothing => backtracking
...
release second 'a' and retry to match 'b', nothing => backtracking
release first 'a'
ooops we're back at the beginning => no match

So it should execute something like 2n operations.

(This documentation seems to confirm that complexity should be linear: http://msdn.microsoft.com/en-us/library/dsy130b4.aspx)

But instead I observe a quadratic complexity:

enter image description here

So my questions are:

  • 1) were my expectations about linear complexity unreasonable?
  • 2) if yes, what am I missing about regex matching?
  • 3) if no, is my benchmark flawed? why?
  • 4) if my expectations and the benchmark are correct what could be the issue(s)?

Thanks in advance for any input.

Was it helpful?

Solution

The Regex.Match function searches for a substring match: the engine tries to match the expression starting at any index of the string, giving an O(n²) algorithm. You can probably achieve linear performance by anchoring the regex to the start of the string:

Regex regex = new Regex("^a+b$", RegexOptions.Compiled);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top