While building a simple regex I've found it to have a quite strange performance profile while input size increased.
Here is another really basic regex that has a similar behavior:
a+b
I've profiled it with a simple benchmark:
Regex regex = new Regex("a+b", RegexOptions.Compiled);
const int maxInputSize = 100;
const int n = 1000;
string input = "";
Stopwatch stopwatch = new Stopwatch();
for (int inputSize = 1; inputSize <= maxInputSize; ++inputSize)
{
input += 'a';
stopwatch.Restart();
for (int i = 0; i < n; ++i)
{
regex.Match(input);
}
stopwatch.Stop();
Console.WriteLine(stopwatch.Elapsed.Ticks);
}
It runs the regex on the strings "a", "aa", "aaa", ... and measure the time it took for each length of the string to do n matches.
I'm aware of the backtracking issue (e.g. if the regex was something like (a+a+)+b
), but in this case even considering backtracking I expected a linear complexity.
As an example if we want to match n times 'a' here is the workflow I naively expected:
take first 'a'
take second 'a'
...
take last 'a'
ooops nothing more to take => backtracking
release one 'a' and try to match 'b', nothing => backtracking
...
release second 'a' and retry to match 'b', nothing => backtracking
release first 'a'
ooops we're back at the beginning => no match
So it should execute something like 2n operations.
(This documentation seems to confirm that complexity should be linear: http://msdn.microsoft.com/en-us/library/dsy130b4.aspx)
But instead I observe a quadratic complexity:
So my questions are:
- 1) were my expectations about linear complexity unreasonable?
- 2) if yes, what am I missing about regex matching?
- 3) if no, is my benchmark flawed? why?
- 4) if my expectations and the benchmark are correct what could be the issue(s)?
Thanks in advance for any input.