Question

This is just a question to satisfy my curiosity. But to me it is interesting.

I wrote this little simple benchmark. It calls 3 variants of Regexp execution in a random order a few thousand times:

Basically, I use the same pattern but in different ways.

  1. Your ordinary way without any RegexOptions. Starting with .NET 2.0 these do not get cached. But should be "cached" because it is held in a pretty global scope and not reset.

  2. With RegexOptions.Compiled

  3. With a call to the static Regex.Match(pattern, input) which does get cached in .NET 2.0

Here is the code:

static List<string> Strings = new List<string>();        
static string pattern = ".*_([0-9]+)\\.([^\\.])$";

static Regex Rex = new Regex(pattern);
static Regex RexCompiled = new Regex(pattern, RegexOptions.Compiled);

static Random Rand = new Random(123);

static Stopwatch S1 = new Stopwatch();
static Stopwatch S2 = new Stopwatch();
static Stopwatch S3 = new Stopwatch();

static void Main()
{
  int k = 0;
  int c = 0;
  int c1 = 0;
  int c2 = 0;
  int c3 = 0;

  for (int i = 0; i < 50; i++)
  {
    Strings.Add("file_"  + Rand.Next().ToString() + ".ext");
  }
  int m = 10000;
  for (int j = 0; j < m; j++)
  {
    c = Rand.Next(1, 4);

    if (c == 1)
    {
      c1++;
      k = 0;
      S1.Start();
      foreach (var item in Strings)
      {
        var m1 = Rex.Match(item);
        if (m1.Success) { k++; };
      }
      S1.Stop();
    }
    else if (c == 2)
    {
      c2++;
      k = 0;
      S2.Start();
      foreach (var item in Strings)
      {
        var m2 = RexCompiled.Match(item);
        if (m2.Success) { k++; };
      }
      S2.Stop();
    }
    else if (c == 3)
    {
      c3++;
      k = 0;
      S3.Start();
      foreach (var item in Strings)
      {
        var m3 = Regex.Match(item, pattern);
        if (m3.Success) { k++; };
      }
      S3.Stop();
    }
  }

  Console.WriteLine("c: {0}", c1);
  Console.WriteLine("Total milliseconds: " + (S1.Elapsed.TotalMilliseconds).ToString());
  Console.WriteLine("Adjusted milliseconds: " + (S1.Elapsed.TotalMilliseconds).ToString());

  Console.WriteLine("c: {0}", c2);
  Console.WriteLine("Total milliseconds: " + (S2.Elapsed.TotalMilliseconds).ToString());
  Console.WriteLine("Adjusted milliseconds: " + (S2.Elapsed.TotalMilliseconds*((float)c2/(float)c1)).ToString());

  Console.WriteLine("c: {0}", c3);
  Console.WriteLine("Total milliseconds: " + (S3.Elapsed.TotalMilliseconds).ToString());
  Console.WriteLine("Adjusted milliseconds: " + (S3.Elapsed.TotalMilliseconds*((float)c3/(float)c1)).ToString());
}

Everytime I call it the result is along the lines of:

    Not compiled and not automatically cached:
    Total milliseconds: 6185,2704
    Adjusted milliseconds: 6185,2704

    Compiled and not automatically cached:
    Total milliseconds: 2562,2519
    Adjusted milliseconds: 2551,56949184038

    Not compiled and automatically cached:
    Total milliseconds: 2378,823
    Adjusted milliseconds: 2336,3187176891

So there you have it. Not much, but about 7-8% difference.

It is not the only mystery. I cannot explain why the first way would be that much slower because it is never re-evaluated but held in a global static variable.

By the way, this is on .Net 3.5 and Mono 2.2 which behave exactly the same. On Windows.

So, any ideas, why the compiled variant would even fall behind?

EDIT1:

After fixing the code the results now look like this:

    Not compiled and not automatically cached:
    Total milliseconds: 6456,5711
    Adjusted milliseconds: 6456,5711

    Compiled and not automatically cached:
    Total milliseconds: 2668,9028
    Adjusted milliseconds: 2657,77574842168

    Not compiled and automatically cached:
    Total milliseconds: 6637,5472
    Adjusted milliseconds: 6518,94897724836

Which pretty much obsoletes all of the other questions as well.

Thanks for the answers.

Was it helpful?

Solution

In the Regex.Match version you are looking for the input in the pattern. Try swapping the parameters around.

var m3 = Regex.Match(pattern, item); // Wrong
var m3 = Regex.Match(item, pattern); // Correct

OTHER TIPS

I noticed similar behavior. I also wondered why the compiled version would be slower, but noticed that above a certain number of calls, the compiled version is faster. So I dug into Reflector a little, and I noticed that for a compiled Regex, there's still a little setup that is performed on first call (specifically, creating an instance of the appropriate RegexRunner object).

In my test, I found that if I moved both the constructor and an initial throw-away call to the regex outside the timer start, the compiled regex won no matter how many iterations I ran.


Incidentally, the caching that the framework is doing when using static Regex methods is an optimization that's only needed when using static Regex methods. This is because every call to a static Regex method creates a new Regex object. In the Regex class's constructor it must parse the pattern. The caching allows subsequent calls of static Regex methods to reuse the RegexTree parsed from the first call, thereby avoiding the parsing step.

When you use instance methods on a single Regex object, then this is not an issue. The parsing is still only performed one time (when you create the object). In addition, you get to avoid running all the other code in the constructor, as well as the heap allocation (and subsequent garbage collection).

Martin Brown noticed that you reversed the arguments to your static Regex call (good catch, Martin). I think you'll find that if you fix that, the instance (not-compiled) regex will beat the static calls every time. You should also find that, given my findings above, the compiled instance will beat the not-compiled one, too.

BUT: You should really read Jeff Atwood's post on compiled regexes before you go blindly applying that option to every regex you create.

If you constantly match the same string using the same pattern, that may explain why a cached version is slightly faster than a compiled version.

This is from documentation;

https://msdn.microsoft.com/en-us/library/gg578045(v=vs.110).aspx

when a static regular expression method is called and the regular expression cannot be found in the cache, the regular expression engine converts the regular expression to a set of operation codes and stores them in the cache. It then converts these operation codes to MSIL so that the JIT compiler can execute them. Interpreted regular expressions reduce startup time at the cost of slower execution time. Because of this, they are best used when the regular expression is used in a small number of method calls, or if the exact number of calls to regular expression methods is unknown but is expected to be small. As the number of method calls increases, the performance gain from reduced startup time is outstripped by the slower execution speed.

In contrast to interpreted regular expressions, compiled regular expressions increase startup time but execute individual pattern-matching methods faster. As a result, the performance benefit that results from compiling the regular expression increases in proportion to the number of regular expression methods called.


To summarize, we recommend that you use interpreted regular expressions when you call regular expression methods with a specific regular expression relatively infrequently.

You should use compiled regular expressions when you call regular expression methods with a specific regular expression relatively frequently.


How to detect?

The exact threshold at which the slower execution speeds of interpreted regular expressions outweigh gains from their reduced startup time, or the threshold at which the slower startup times of compiled regular expressions outweigh gains from their faster execution speeds, is difficult to determine. It depends on a variety of factors, including the complexity of the regular expression and the specific data that it processes. To determine whether interpreted or compiled regular expressions offer the best performance for your particular application scenario, you can use the Stopwatch class to compare their execution times.


Compiled Regular Expressions:

We recommend that you compile regular expressions to an assembly in the following situations:

  1. If you are a component developer who wants to create a library of reusable regular expressions.
  2. If you expect your regular expression's pattern-matching methods to be called an indeterminate number of times -- anywhere from once or twice to thousands or tens of thousands of times. Unlike compiled or interpreted regular expressions, regular expressions that are compiled to separate assemblies offer performance that is consistent regardless of the number of method calls.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top