Question

While running some tests for this answer, I noticed the following unexpected behavior. This will remove all occurrences of <tag> after the first:

var input = "<text><text>extra<words><text><words><something>";
Regex.Replace(input, @"(<[^>]+>)(?<=\1.*\1)", "");
// <text>extra<words><something>

But this will not:

Regex.Replace(input, @"(?<=\1.*)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>

Similarly, this will remove all occurences of <tag> before the last:

Regex.Replace(input, @"(<[^>]+>)(?=.*\1)", "");
// extra<text><words><something>

But this will not:

Regex.Replace(input, @"(?=\1.*\1)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>

So this got me thinking…

In the .NET regular expression engine, does a backreference need to appear after the group it's referencing? Or is there something else going on with these patterns that's causing them not to work?

Was it helpful?

Solution

Your question got me thinking too, so I ran a few tests with RegexBuddy and to my surprise the second regex (?<=\1.*)(<[^>]+>) which you said didn't work actually worked and the others worked exactly like you said. I then tried the same expression - the second one - in C# code but it didn't work like what happened with you.

This got me confused, then I noticed that my RegexBuddy version dates back to 2008 so there must have been some change in how the .NET engine works, but this shed the light on a fact I though is rational, it seems that before 2008 lookbehinds were evaluated after the rest of the expression matched. I felt this behavior is a bit acceptable with lookbehinds since you need to match something before you look behind to match something before it.

Nevertheless, the engines these days seem to evaluate lookarounds when it encounters them and I was able to find this out by using the following expression which is like the reverse situation of your case:

(?<=(\w))\1

As you can see I captured a word character inside the regex and referenced it outside it. I tested this on the string hello and it matched at the second l character as expected and this proves that the lookbehind was executed before attempting to match the rest of the expression.

Conclusion: Yes, a back reference need to appear after the group it references or it will have no match semantics.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top