Frage

Some context

I'm usually using the website http://regex101.com to test my regex, which provides a "debugger" feature in PCRE that lets you see what the regex engine is doing step by step.

When matching a random string with .*, this debugger tells me the engine follows the constant number of 3 steps.

When matching with (?:.)*, it announces a number depending on the length: 66 steps for something like 0123456789012345678901234567899.

Is (?:.)* really more costly than .*?

It seems that on the latter case, entering the group is considered each time to be a new step, whereas on the former the .* is applied at once.

Is that some sort of "improvement" the website is doing (trying to avoid showing useless cases), or does it match a real internal regex mechanism ? And if so, what's the idea behind the scene?

War es hilfreich?

Lösung

You didn't specify which engine. You mentioned PCRE, but you also tagged with Perl.


regex101 site shows that PCRE handles .* using one operation, but that doesn't mean that one operation is faster than the operations generated by the equivalent (?:.)*. Only benchmarking will tell, but .* will likely be marginally faster due to less overhead.


In Perl, they compile to exactly the same regex program (as you can see below), so they will perform identically.

>perl -Mre=debug -e"'0123456789012345678901234567899' =~ /.*/"
Compiling REx ".*"
Final program:
   1: STAR (3)
   2:   REG_ANY (0)
   3: END (0)
anchored(MBOL) implicit minlen 0
Matching REx ".*" against "0123456789012345678901234567899"
   0 <> <0123456789>         |  1:STAR(3)
                                  REG_ANY can match 31 times out of 2147483647...
  31 <901234567899> <>       |  3:  END(0)
Match successful!
Freeing REx: ".*"

>perl -Mre=debug -e"'0123456789012345678901234567899' =~ /(?:.)*/"
Compiling REx "(?:.)*"
Final program:
   1: STAR (3)
   2:   REG_ANY (0)
   3: END (0)
anchored(MBOL) implicit minlen 0
Matching REx "(?:.)*" against "0123456789012345678901234567899"
   0 <> <0123456789>         |  1:STAR(3)
                                  REG_ANY can match 31 times out of 2147483647...
  31 <901234567899> <>       |  3:  END(0)
Match successful!
Freeing REx: "(?:.)*"

In both cases, the string is scanned for characters than aren't newlines, and that's it.


Note that no matter how many "steps" are taken, this cannot be done in constant time. . doesn't match newlines (without /s), so the regex engine must check each character it's about to match to see whether it's a newline or not.

Andere Tipps

You can use pcretest for seeing the differences. Here is a nice tutorial.

Your first example obviously need few steps, then the second. On the left side with the + you see the position in the pattern, on the right side the matching position in the input.

1.) /.*/CD on str 0123456789012345678901234567899 with debugging modifiers CD

pcretest1

2.) /(?:.)*/CD same str

pcretest2

And that is, what really happens.

I have written a small benchmark code to test this scenario. And your both regex returns almost the same time on performance. So not sure about which one is better.

However, I have changed your changed your regex (?:.)* into (.)* and it drastically reduce the performance. I believe it is because of the group capturing. Here is the code:

use Benchmark qw( cmpthese );

cmpthese(-3, {    
    '.*'     => '"kasdaskdhas dhaskdh askdhqwioeuweakjsdhasjdk asjdk ask" =~ /.*/',
    '(?:.)*' => '"kasdaskdhas dhaskdh askdhqwioeuweakjsdhasjdk asjdk ask" =~ /(?:.)*/',
    '(.)*'   => '"kasdaskdhas dhaskdh askdhqwioeuweakjsdhasjdk asjdk ask" =~ /(.)*/',
});

Outputs:

            Rate   (.)* (?:.)*     .*
(.)*   2305921/s     --   -34%   -35%
(?:.)* 3499870/s    52%     --    -1%
.*     3524871/s    53%     1%     --

That 1% difference between .* and (?:.)* is noise and meaningless.

The website doesn't seem to explain what those steps are!

This has to be an optimisation issue. The question "Is (?:.)* really more costly than .*?" depends on the regex engine in use, and the site is very unlikely to use Perl's regex engine which is built into the perl compiler/interpreter. The optimisation in whatever it does use has chosen to ignore trivial cases like (?:.)* that are unlikely in the real world.

If you need your regex to run faster then you should use Benchmark to compare different patterns, or perhaps Regexp::Optimizer, which will attempt to rewrite your pattern into an equivalent faster one, or Regexp::Debugger which will allow you to see what is going on behind the scenes.

But please don't take these measures until you have written a functional and clear program that doesn't perform fast enough, and have proven that the bottleneck is the regex matching. The regex engine is wholly written in C and you are unlikely to make a big difference to the overall speed of your code by changing the regex patterns that it uses.

I'm not an expert on the subject, but from what I can tell, yes, /(?:.)*/ and /(.)*/ are more costly than /.*/.

According to the Perl documentation on backtracking,

A fundamental feature of regular expression matching involves the notion called backtracking, which is currently used (when needed) by all regular non-possessive expression quantifiers, namely * , *? , + , +?, {n,m}, and {n,m}?. Backtracking is often optimized internally, but the general principle outlined here is valid.

So basically, .* is optimized internally, but I can't find a source that says how.

I also found another source, a blog post by the author of Mastering Regular Expressions, Jeffrey Friedl.

By the way, I guess I should make one mention about how Perl sometimes optimizes how it deals with regular expressions. Sometimes it might actually perform fewer tests than what I've described. Or it perhaps does some tests more efficiently than others (for example, /x*/ is internally optimized such that it is more efficient than /(x)/ or /(?:x)/). Sometimes Perl can even decide that a regex can never match the particular string in question, so will bypass the test altogether.

If anyone else can explain the optimizations Perl makes in more detail, that would be useful!

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top