The key difference in the matching path between .+?>
is that in order to match the current character, .+?>
has to look at the following character, where as [^>]+
does not. [^>]+
means "match one or more characters that are not a >
... and it will just eat them up without giving it a second thought.
Why does the .+?>
need to look ahead and cause backtracking?
In contrast, at each step, the .+?>
goes one step forward then one step back. Why?
Let's say you're trying to match thing>
using .+?>
. At the first step, in front of the t
, because the ?
is lazy, the dot in .+?>
matches zero characters. The engine then advances to the next character. There, it tries to match the >
, but fails. The engine therefore backtracks, and the lazy quantifier then gets off the couch and allows the dot to match. The process is repeated for h, i, n and g: for each character, the lazy dot first matches zero characters; then the engine tries to match the >
, fails, backtracks, and matches the letter.
This is clearly shown in the RegexBuddy debugger where RB tries to match thing>
using .+?>
Compare that with this screenshot where RB tries to match thing>
using [^>]+>