how does using a negated character class work internally (without backtracking)?

StackOverflow https://stackoverflow.com/questions/23662454

  •  22-07-2023
  •  | 
  •  

Question

I read that to stop backtracking in regex a negated character class can be used. Like if we want to match <Em> in This <Em>is the shiz <Em>, we can use <[^>]+> which is faster than <.+?> because the later backtracks after each character but the former does not back track at all.
Can someone please explain how <[^>]+> matches internally?

Was it helpful?

Solution 2

The key difference in the matching path between .+?> is that in order to match the current character, .+?> has to look at the following character, where as [^>]+ does not. [^>]+ means "match one or more characters that are not a >... and it will just eat them up without giving it a second thought.

Why does the .+?> need to look ahead and cause backtracking?

In contrast, at each step, the .+?> goes one step forward then one step back. Why?

Let's say you're trying to match thing> using .+?>. At the first step, in front of the t, because the ? is lazy, the dot in .+?> matches zero characters. The engine then advances to the next character. There, it tries to match the >, but fails. The engine therefore backtracks, and the lazy quantifier then gets off the couch and allows the dot to match. The process is repeated for h, i, n and g: for each character, the lazy dot first matches zero characters; then the engine tries to match the >, fails, backtracks, and matches the letter.

This is clearly shown in the RegexBuddy debugger where RB tries to match thing> using .+?>

RegexBuddy debugger

Compare that with this screenshot where RB tries to match thing> using [^>]+>

RB debugger

OTHER TIPS

You must first understand how work a greedy and a lazy quantifier.

A lazy quantifier will test on each character if the subpattern that follows (> in your example) matches. A greedy quantifier will take all possible characters and only after, if needed, will backtrack to make the next subpattern (> in your example) match.

But if instead of the dot (that matches all except newlines) you use a negated character class that doesn't contain the character >, you are sure to not have the steps of backtracking. I hope to be clear!

To illustrate what I say, I suggest you to try these three patterns with the debugger of http://regex101.com : <.*>, <.*?>, <[^>]*> With this string : <abcd efgh="ijkl" mnop="qrst"> lapin

About PCRE in particular: The PCRE lib is compiled by default with automatic internal optimisations. If you use a negated character class followed by the excluded character, the quantifier is automatically converted to a possessive quantifier. This feature can only be changed at compile time. (source). It is probably the same with Perl.

Can someone please explain how <[^>]+> matches internally?

Actually, that's the easier one to understand. It says:

Match the sequence
1) one   '<'
2) at least one (as many as posible) '[^>]'  (i.e.: any character except for '>')
3) one '>'

Then the matcher, searching at This <Em>is the shiz <Em> will do what one expects. Starting from the beginning it will (tentatively) match the first < , then go for the part 2, and match the Em and then go for part 3) ... and find the > : done!

With <.*?> the instruction is:

Match the sequence
1) one   '<'
2) any amount (least possible) of '.' (i.e.: any character)
3) one '>'

Now, starting from the beginning it will (tentatively) match the first < , then go for the part 2, and (here the first difference) it will first say, "hey, I found a empty string (that one between the < and the E), it matches step 2, let's go for step 3". But then it will fail ("I found a E, I expected a >); then it will go back in the pattern sequence (but not in the input sequence! I wouldn't call this backtracking, just branching looking ahead), ok, let's try matching E for step 2) (it matches, go on ...) it will fail again... but in the next repetition it will succeed. So this functionally equivalente and slightly less efficient.

Be aware the alternatives are not functionally equivalent if the pattern does not end there. For example, if the pattern were, say: <.*?><br> and I give it the input hi <em>hello ! <em><br>bye it will match <em>hello ! <em><br> which is not what you probably want.

a+ keeps matching characters until it reaches one that's not a.

[a-z]+ keeps matching characters until it reaches one that's not a, z, or something in between.

[^a-z]+ keeps matching characters until it reaches one that is a, z, or something in between. Negated character classes and regular character classes work exactly the same as far as the matcher is concerned; it's just that at regex construction time, the inversion list is flipped so that instead of only matching inside the specified ranges, it only matches outside the specified ranges.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top