Question

Given the following regular expression and subject text, why is the negative lookahead only applying to the last character of the named capture group URL?

// Regex
(?<URL>(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*)(?!'|"|(</a))

// Subject text
<p><a href="http://example.com">http://example.com</a> and http://example.com</p>

This regex has a negative lookahead (?!"|(</a)) which is an attempt to not match URLs that are within a <a> tag. This is done by checking if the URL is followed by a quote (' or ") or a closing </a tag.

I'm getting the following results

http://example.co  
http://example.co  
http://example.com

I expected the negative lookahead to apply to the whole capture group, not just it's last char. Is this possible? What am I doing wrong? I expected to match only the last instance of http://example.com to be captured.

Was it helpful?

Solution

Because when the negative lookahead fails the quantifiers (and anything else that can) will backtrack, till it finds a match.

You can force an expression not to backtrack by using atomic groups (?>expression):

(?<URL>(?>(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*))(?!'|"|(</a))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top