A Question of Greedy vs. Negated Character Classes in Regex

https://stackoverflow.com/questions/168531

03-07-2019
|

Question

I have a very large file that looks like this (see below). I have two basic choices of regex to use on it (I know there may be others but I'm really trying to compare Greedy and Negated Char Class) methods.

ftp: [^\D]{1,}
ftp: (\d)+
ftp: \d+

Note: what if I took off the parense around the \d?

Now + is greedy which forces backtracking but the Negated Char Class require a char-by-char comparison. Which is more efficient? Assume the file is very-very large so minute differences in processor usage will become exaggerated due to the length of the file.

Now that you've answered that, What if my Negated Char Class was very large, say 18 different characters? Would that change your answer?

Thanks.

ftp: 1117 bytes
ftp: 5696 bytes
ftp: 3207 bytes
ftp: 5696 bytes
ftp: 7200 bytes

Solution

Both your expressions have the same greediness. As others have said here, except for the capturing group they will execute in the same way.

Also in this case greediness won't matter much at the execution speed since you don't have anything following \d*. In this case the expression will simply process all the digits it can find and stop when the space is encountered. No backtracking should occur with these expressions.

To make it more explicit, backtracking should occur if you have an expression like this:

\d*123

In this case the parser will engulf all the digits, then backtrack to match the three following digits.

OTHER TIPS

[^\D]{1,} and \d+ is exactly the same. The regex parser will compile [^\D] and \d into character classes with the equal content, and + is just short for {1,}.

If you want lazy repetition you can add a ? at the end.

\d+?

The character classes are usually compiled into bitmaps for ASCII-characters. For Unicode (>=256) it is implementation dependent. One way could be to create a list of ranges, and use binary search on it.

For ASCII the lookup time is constant over the size. For Unicode it is logarithmic or linear.

This is kind of a trick question as written because (\d)+ takes slightly longer due to the overhead of the capturing parentheses. If you change it to \d+ they take the same amount of time in my perl/system.

Yeah, I agree with MizardX... these two expressions are semantically equivalent. Although the grouping could require additional resources. That's not what you were asking about.

My initial tests show that [^\D{1,} is a bit slower than \d+, on a 184M file the former takes 9.6 seconds while the latter takes 8.2

Without capturing (the ()'s) both are about 1 second faster, but the difference between the two is about the same.

I also did a more extensive test where the captured value is printed to /dev/null, with a third version splitting on the space, results:

([^\D]{1,}): ~18s
(\d+): ~17s
(split / /)[1]: ~17s

Edit: split version improved and time decreased to be the same or lower than (\d+)

Fastest version so far (can anyone improve?):

while (<>)
{
    if ($foo = (split / /)[1])
    {
        print $foo . "\n";
    }
}

Not a direct answer to the question, but why not a different approach altogether, since you know the format of the lines already? For example, you could use a regex on the whitespace between the fields, or avoid regex altogether and split() on the whitespace, which is generally going to be faster than any regular expression, depending on the language you're using.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow