Question

This is using the .NET regex engine.

I am attempting to use a conditional inside of a lookbehind clause. When the expression is used outside of the lookbehind it behaves as I expect -- but when placed in the lookbehind, it has a different behavior.

Here is a simple example of the effect to try to duplicate the issue.

Matching:

good morning

with the regular expression:

(?<=(?(?=go)good|bad)\s)morning

yields no match.

When tried without the look behind:

(?(?=go)good|bad)\smorning

I get a match on "good morning"

By fiddling around, I discovered that the lookahead cursor location, when it is inside the lookbehind, is after the word "good":

(?<=(?(?=\smor)good|bad)\s)morning

This matches "morning".

My question is is this expected or some kind of bug?

Obviously this example is not real world - the problem that I was trying to solve when I stumbled on this issue is as follows: The expression uses a conditional to determine the length of the next word, then uses two different sets of rules for matching on that word. Similar to:

(?<=\s+(?(?=[^\s]{1,2}\s)[A-Z0-9]+|(?![A-Z]+\s)[0-9-A-Z/"']+))\s+matching\s+text

This matches the "matching text" only if a one or two letter word consisting of letters and numbers, or a longer word not consisting of only letters but can contain numbers, letters, slashes, dashes, quotes and apostrophes.

The following should match "matching text":

1 matching text
a matching text

It only matches on the first one, because the conditional evaluated to false (it was looking at the " matching" instead of "a") and the negative look ahead searching for a word consisting of all letters failed on the "a".

Further examples:

Must match "matching text":

123-1 matching text
9B matching text
15/16 matching text
"45" matching text
A matching text
AA matching text
A1 matching text

Must not match "matching text"

and matching text
" matching text
A- matching text
Was it helpful?

Solution 6

I think I understand the issue now. The important difference between a conditional inside of a lookbehind and a conditional not inside a lookbehind is the timing of when the conditional is executed (or where the search cursor is at that point.

The example without a lookbehind:

(?(?=go)good|bad)\smorning

good morning

The conditional is run at the beginning of the search. So the search cursor is before the 'g' in 'good'.

 good morning
^

So at this point the look ahead evaluates to TRUE since it sees matches on the 'go'

In the example with the lookbehind, the cursor is at a different location.

(?<=(?(?=go)good|bad)\s)morning

The search cursor finds the first required item in the text: 'morning'

good morning
     ^

The actual search cursor stays in place to consume 'morning' if it the lookbehind matches. The lookbehind will use its own cursor to verify what is before 'morning' to determine if this 'morning' is a match. The lookbehind states that there is a '\s' directly before 'morning' and indeed there is. The temporary lookbehind cursor moves to the space:

good morning
    ^^

Now it gets to the conditional and runs the lookahead in the conditional statement. At this point the lookahead is looking for 'go' but it sees ' morning'. So the conditional fails. The expression says to try to match on 'bad' (or dab backwards from the lookbehind cursor) but it sees good (or doog backwards from the lookbehind cursor). So no match for 'morning'.

Solution

Since the conditional is run at the end of the word of interest when it is in a lookbehind (instead of at the beginning of that word outside of a lookbehind), the secret is to reverse the conditional to a lookbehind instead of a lookahead:

(?<=(?(?<=good)good|bad)\s)morning

This actually matches on morning. It look nonsensical in this example to look for a word and then match it - but it illustrates the concept. In the real world case stated in the question, the solution looks like this:

(?<=(?(?<=\s\S{1,2})[A-Z0-9]+|(?![A-Z]+\s)[0-9-A-Z/"']+))\s+matching\stext

This looks for a word before matching text. The conditional checks to see if it is a one or two character word. If so, it should consist of only letters and numbers. If not, it must not only consist of letters. If this is met, it can consist of letters, numbers, dashes, slashes, quotes and apostrophes.

Two things changed:

  1. Modified the conditional to be a lookbehind instead of a lookahead
  2. I had to move the '\s' from the beginning of the lookbehind into the conditional, because the conditional is being processed before that space and it results in matching on the last one or two characters of the word instead of looking for a one or two character word. That is a tricky issue, because when the expression is not in a lookbehind (for instance if you want the text in the lookbehind to be included in the match), this change messes up the match.

What follows is some more analysis of the original problem. Enjoy!

@zx81 had an example that is actually better at illustrating what is going on. This example has more cursor movement so it does help illustrate what is happening:

    (?<=(?(?=go)good|bad))\w+pher

badphilosopher    <-- 'philosopher' matches
goodgopher        <-- 'gopher' matches
badgopher         <-- no match

There is a big difference in this example because the \w+ is used. So the regex engine immediately matches all the text in each example since the phrase has no white space and ends in 'pher'.

So for 'badphilosopher':

Lookbehind is run and the conditional is immediately run looking for 'go' but finds 'ba'

badphilosopher
^

The condition failed so it tries to match bad to the left of the cursor, but we are at the beginning of the phrase, so no match.

It checks again at these two cursor points because the '\w+pher' matches each time:

badphilosopher
  ^

But the lookbehind sees 'b' then 'ba'

When the cursor gets to:

badphilosopher
   ^

The conditonal again fails to find a 'go' (it sees 'ph') so it attempts to match 'bad' to the left of the cursor and finds it! So there fore the \w+pher matches the philosopher text.

goodgopher
    ^

goodgopher matches in a similar way except that the conditional is successful.

badgopher
   ^

badgopher does not match because the conditonal is successful but 'good' is not found to the left of the cursor.

Putting a space in there really changes things up, because the /w+pher no longer matches the entire string.

    (?<=(?(?=go)good|bad)\s+)\w+pher

bad philosopher    <-- matches philosopher
good gopher        <-- no match
bad gopher         <-- matches gopher

In this case the cursor moves through the string until it can match \w+pher:

bad philosopher
    ^

At this point it starts to process the lookbehind -- and sees that a '\s+' is required to the left of the search cursor -- it finds this and moves the temporary lookbehind cursor over.

bad philosopher
   ^^

The conditional is run now and sees looking for 'go' at the temp lookbehind cursor but finds ' p'. The failure means trying to match bad to the left of the temp lookbehind cursor and indeed it finds it there.

good gopher
    ^^

The 'good gopher' example gets to the conditional and sees ' g' so it fails and then looks for 'bad' to the left of the cursor and doesn't find it. So this fails.

bad philosopher
   ^^

Similarly, 'bad philosopher' gets to the conditonal and finds ' p' and looks for 'bad' to the left of the cursor and finds it. So it matches.

When run without the lookbehind, all of these examples match. This can be rather counterintuitive - but you have to take the location of the cursors into account in the lookbehind.

OTHER TIPS

Your question is a great one to dive into lookarounds and lookaheads. It is a rich question, but to me it seems that the main question is this:

Why does (?<=(?(?=go)good|bad)\s)morning fail against "good morning"?

Because you say "My question is is this expected or some kind of bug?" This is the question I will address.

In a nutshell, at a certain position x in the string, you want to say:

  1. If what immediately follows is "go" i.e. (?=go) then immediately
    match "morning". Obviously this is incompatible. If you had
    "gorning", you would have a chance. What immediately follows cannot both be "mo" and "go". There is more syntax in that branch, but that is where it fails.
  2. Otherwise, (i.e. what follows is not "go"), then if what immediately precedes is "bad ", match "morning". This would match "morning" in "bad morning", but not "good morning".

Therefore the regex cannot match "good morning". QED.

Let's now take a look at a correct program with nearly the same expression. It successfully matches "goodgopher" and "badphilosopher". Once you understand why this works, you will understand why the other does not.

using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {

// Simplest    
string s1 = "badphilosopher";
string s2 = "goodgopher";
string s3 = "badgopher";
string pattern = @"(?<=(?(?=go)good|bad))\w+pher";
Console.WriteLine(Regex.IsMatch(s1, pattern) );
Console.WriteLine(Regex.IsMatch(s2, pattern));
Console.WriteLine(Regex.IsMatch(s3, pattern));     
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program

The output:

True
True
False

The key is to understands that a lookaround asserts: "at this exact position in the string, I am followed (or preceded) by..."

When it is evaluated, the lookaround is firmly in place in one part of the string. If you string several lookarounds in a row, that position does not jump around.

So how does the above regex work?

At the current position in the string, we assert:

If at this position in the string, if we are preceded by x
   case 1: if at this position in the string what follows is "go"
           then x is good
   case 2: else
           then x is bad
Then match any number of word characters followed by philosopher

To get absolutely clear on lookahead and lookbehind, you may want to read this page about regex lookarounds.

Sub-Question in Comment:

(?<=(?(?=go)good|bad))\s+\w+pher matches on "bad philosopher" and "bad gopher". Why?

Here is why. In either "bad philosopher" or "bad gopher", place yourself just after the "d". This is your position x. At this position,

  1. You can match \s+\w+pher, as this is exactly what is in front of you
  2. In both cases, a position x, what immediately follows is not "go": right after the "d", in one case what follows is " g", in the other case it is " p" (note the space before the letter). Therefore, for both of those, you are not in the (?=go) situation but in the else, and you must assert "What immediately precedes position x is "bad". Is that true? Yes.
  3. Therefore the regex can find matches in both strings. QED. But the matches are not found where you expected, but right after the "d".

But... What Does the Engine Really Do?

So far, I have not been able to convince the OP about the path taken by the regex engine, so I thought I would paste a trace. Sadly, I do not have a trace tool for .NET regex, but I do have a trace tool for PCRE, an equally if not more potent regex engine.

The trace shows the path of the engine, as reported by pcretest, for the three test strings. Note one tiny difference with the regex in the .NET code supplied above: instead of good|bad we have goo|bad to accommodate for PCRE's lack of support for variable-length negative lookbehinds.

The trace clearly shows that PCRE evaluates the regex from left to right: first the lookbehind, then the \w+pher

I cannot say with 100% certainty that .NET proceeds in the same way, but this has certainly been the standard for regex engine. Of course there is a possibility that engines that support variable-length lookbehinds would proceed differently.

PCRE version 8.34 2013-12-15

~(?<=(?(?=go)goo|bad))\w+pher~C
badphilosopher
--->badphilosopher
 +0 ^                  (?<=(?(?=go)goo|bad))
 +0  ^                 (?<=(?(?=go)goo|bad))
 +0   ^                (?<=(?(?=go)goo|bad))
 +0    ^               (?<=(?(?=go)goo|bad))
 +4    ^                                      ^(?(?=go)goo|bad)
 +6    ^                                      ^(?=go)
 +9    ^                                      ^g
+16    ^                                      ^b
+17    ^                                      ^a
+18    ^                                      ^d
+19    ^               )
+20    ^               )
+21    ^               \w+
+24    ^          ^    p
+24    ^         ^     p
+24    ^        ^      p
+24    ^       ^       p
+24    ^      ^        p
+25    ^       ^       h
+26    ^        ^      e
+27    ^         ^     r
+28    ^          ^    
 0: philosopher
googopher
--->googopher
 +0 ^             (?<=(?(?=go)goo|bad))
 +0  ^            (?<=(?(?=go)goo|bad))
 +0   ^           (?<=(?(?=go)goo|bad))
 +0    ^          (?<=(?(?=go)goo|bad))
 +4    ^                                 ^(?(?=go)goo|bad)
 +6    ^                                 ^(?=go)
 +9    ^                                 ^g
+10    ^                                 ^o
+11    ^                                 ^)
+12    ^                                 ^g
+13    ^                                 ^o
+14    ^                                 ^o
+15    ^          |
+20    ^          )
+21    ^          \w+
+24    ^     ^    p
+24    ^    ^     p
+24    ^   ^      p
+24    ^  ^       p
+24    ^ ^        p
+25    ^  ^       h
+26    ^   ^      e
+27    ^    ^     r
+28    ^     ^    
 0: gopher
badgopher
--->badgopher
 +0 ^             (?<=(?(?=go)goo|bad))
 +0  ^            (?<=(?(?=go)goo|bad))
 +0   ^           (?<=(?(?=go)goo|bad))
 +0    ^          (?<=(?(?=go)goo|bad))
 +4    ^                                 ^(?(?=go)goo|bad)
 +6    ^                                 ^(?=go)
 +9    ^                                 ^g
+16    ^                                 ^b
+17    ^                                 ^a
+18    ^                                 ^d
+19    ^          )
+20    ^          )
+21    ^          \w+
+24    ^     ^    p
+24    ^    ^     p
+24    ^   ^      p
+24    ^  ^       p
+24    ^ ^        p
+25    ^  ^       h
+26    ^   ^      e
+27    ^    ^     r
+28    ^     ^    
 0: gopher

I've changed your example a little, so that it can be used in PCRE for comparison. PCRE requires a fixed width lookbehind. So both alternatives are 4 characters wide. I've changed the first alternative to match 4 characters go.. and the second alternative to match any 4 characters .....

These are the strings to be matched in the examples. I'm deliberately using underscores instead of spaces for them to show up visibly in capturing groups:

good_morning
mild_morning
cold_morning
gold_good_morning


No conditional

With almost everything in named capturing groups, it's easier to see what is matched by which part of the expression. First I'll not use a conditional, but an alternation in the lookbehind. The group name del is just short for "delimiter":

(?<= #  either                            or
     ( (?=(?<look>.*)) (?<either>go..) | (?<or>....) ) # $1
     (?<del>.)  # One width character
) 
morning         # The match

This matches "morning" on all 4 strings. The interesting bits are what's in the capturing groups. I used Regex Storm to get the captured groups:

Index Position Matched String $1 ${look} ${either} ${or} ${del}
0 5 morning good good_morning good empty string _
1 19 morning mild empty string empty string mild _
2 33 morning cold empty string empty string cold _
3 52 morning good good_morning good empty string _

Since the lookahead is associated with the first alternative, it's no surprise that it's empty when the second alternative is being matched. But when it matches, it matches the entire string. Almost. If ((?<del>.) # One width character is changed to (?<del>.*) # As mush as possible, the last row becomes:

Index Position Matched String $1 ${look} ${either} ${or} ${del}
3 52 morning gold gold_good_morning gold empty string _good_

It seems as though the lookahead cursor is offset from the main matching regex cursor.

According to Jan Goyvaerts's Lookahead and Lookbehind Zero-Length Assertions

the JGsoft engine and the .NET framework RegEx classes. These regex engines really apply the regex inside the lookbehind backwards, going through the regex inside the lookbehind and through the subject string from right to left.

and RexEgg.com's Mastering Lookahead and Lookbehind when going through the pattern (?<=(\d{1,5}))Z on the string "123Z".

.NET captures 123 to Group 1. The .NET engine has a far more efficient way of processing variable-width lookbehinds. Instead of trying multiple fixed-width patterns starting at points further and further back in the string, .NET reverses the string as well as the pattern inside the lookbehind, then attempts to match that single pattern on the reversed string. Therefore, in 123Z, to try the lookbehind at the point before Z, it reverses the portion of string to be tested from 123 to 321. Likewise, the lookbehind (?<=(\d{1,5})) is flipped into the lookahead (?=(\d{1,5})). \d{1,5} matches 321.

Finding the reversed regex

Reversing "good_morning" gives "gninrom_doog". For the regex to match "morning", the main regex cursor must be just between "_" and "m". I'm using a space to illustate between:

gninrom _doog
       ^

Reversing parts of the lookbehind:

(?=  # lookahead
     (?<del>.)  # One width character
     ...
) 

The alternation must not have been reversed, since the second alternative matches any 4 characters and if that had been changed to be the first alternative, the groups look and either would be empty on all matches.

     ( (?<either>..og) | (?<or>....) )

The look lookahead captures the entire string including "good", so it can only have become a lookbehind after the first alternative:

     ( (?<either>..og) (?<=(?<look>.*)) | (?<or>....) )

That gives:

(?=  # lookahead
     (?<del>.)  # One width character
     #  either                             or
     ( (?<either>..og) (?<=(?<look>.*)) | (?<or>....) )
) 

Going through this regex for the reversed string will natually yeild the result seen in the results. Going through it for "gold_good_morning":

    gninrom _doog_dlog
           ^
  • The del first takes a character, which is the "_"

    gninrom_ doog_dlog
            ^
    
  • The either group captures "doog"

    gninrom_doog _dlog
            ----^
    
  • The look captures the string from the current cursor all the way back:

    gninrom_doog _dlog
    ------------^
    
  • Once the lookahead (reversed lookbehind) is done, the string and the captures are reversed back.

Adding the Kleene star to del, the (?<del>.*) will greedily go to the end of the string. Then it will backtrack 4 characters and match (?<either>..og) on the string "dlog" before capturing everything into (?<=(?<look>.*)).

Adding the conditional

Adding the conditional instead of just the alternation, putting the lookahead into the condition:

(?<= #  if             then            else
     (?(?<=(?<if>.*)) (?<then>go..) | (?<else>....) ) 
     (?<del>.)  # One width character
) 
morning         # The match

Gives two matches. The condition if is always true and the then-part only mathes words starting with "go"

Index Position Matched String ${if} ${then} ${else} ${del}
0 19 morning _morning good empty string _
1 52 morning _morning good empty string _

Changing the (?<del>.) # One width character to (?<del>.*) # As mush as possible, the last row becomes:

Index Position Matched String ${if} ${then} ${else} ${del}
1 52 morning _good_morning gold empty string good

The interesting part is the capturing group if. It doesn't contains the entire string. That indicates that the conditional isn't reversed by being evaluated last, but instead it is still being evaluated at the beginning of that part of the reserved expression. Though the direction of the lookaround has been reversed:

(?= (?<del>.)  # One width character
    #  if             then            else
    (?(?<=(?<if>.*)) (?<then>..og) | (?<else>....) ) 
) 

Looking at how that will process "gold_good_morning", the capturing groups makes sense.

    gninrom _doog_dlog
           ^
  • The del first takes a character, which is the "_"

    gninrom_ doog_dlog
            ^
    
  • The if lookbehind captures everything from the cursor, which now includes the "_"

    gninrom_ doog_dlog
    --------^
    
  • Since the condition is true, the then part is executed capturing "doog"

    gninrom_doog _dlog
            ----^
    

Adding the Kleene star to del, the (?<del>.*) will again greedily go to the end of the string. But it's forced to backtrack 4 characters for either of then or else to make a match.


Comparison to PCRE

The Alternation

In order for PCRE to accept the alternation as fixed-width, I added a lookahead for the second alternative too:

(?<= #  either                                  or
     ( (?=(?<lookeither>.*)) (?<either>go..) | (?=(?<lookor>.*)) (?<or>....) ) # $1
     (?<del>.)  # One width character
) 
morning         # The match
Matched String $1 lookeither either lookor or del
morning good good_morning good null null _
morning mild null null mild_morning mild _
morning cold null null cold_morning cold _
morning good good_morning good null null _

At regex101

Instead of reversing the string, it counts back the 5 characters that is to be consumed inside the lookbehind and does its work from here.

Since PCRE doesn't support variable length lookbehinds, it's not possible to use the Kleene star on del, but using (?<del>.{6}) # Fixed six width character (to account for "_good_") gives one result. It can only give the one result as it has to backtrack a total of 10 characters. Only the string "gold_good_morning" makes that possible:

Matched String $1 lookeither either lookor or del
morning good gold_good_morning good null null good

The Conditional

(?<= #  if            then            else
     (?(?=(?<if>.*)) (?<then>go..) | (?<else>....) ) 
     (?<del>.) # One width character
) 
morning        # The match

gives this:

Matched String if then else del
morning good_morning good empty string _
morning good_morning good empty string _

At regex101

Since nothing is reversed, it's working as you expected it to work.

Using (?<del>.{6}) # Fixed six width character gives:

Matched String if then else del
morning gold_good_morning gold empty string _

I expect there's no surprises here.


Is it a bug or a feature?

If you want it to give you the same captures as PCRE, then I suppose you could call it a bug. But I'd argue that once you know how it works, you can tailor your expressions to make use of it, and then call it a feature. I'd argue that non-fixed width lookbehinds certainly is a feature.

Knowing that the .NET regex engine reverses the direction of the lookaround, you can simple just reverse the lookaround look, so that it's initally opposite. Sort of like backing in reverse is actually going forward :)

(?<= #  if             then            else
     (?(?<=(?<if>.*)) (?<then>go..) | (?<else>....) ) 
     (?<del>.) # One width character
) 
morning        # The match

will give you

Index Position Matched String ${if} ${then} ${else} ${del}
0 19 morning good good empty string _
1 52 morning gold_good good empty string _

Making your example work

Using your conditional for matching "morning" in "good morning", you can use

(?<=(?(?<=(go.*))good|bad)\s)morning

Obviously (?<=(go)) won't do, since that would be trying to match "og" against "do" in "doog". But .*og is going to match "doog".

You can try something like this:

(?<=(?:(?=\S*\d)[\p{P}\d]+|[A-Z0-9]+)\s)matching text

We can start with \s+matching\s+text. To eliminate match for and matching text and similar we can add negative lookbehind: (?<!^\s*[A-Z]{3,}).

To include match for others we can use positive lookbehind:

Match 2-letter patterns: [A-Z\d]{1,2}

Match longer patterns: [-A-Z\d/"']{3,} <= important, we eliminated strings composed only of letters with negative lookbehind.

Combining together 2 patterns we get the following: (?<=^\s*([A-Z\d]{1,2}|[-A-Z\d/"']{3,}))

Full regex would be:

(?<!^\s*[A-Z]{3,})(?<=^\s*([A-Z\d]{1,2}|[-A-Z\d/"']{3,}))\s+matching\s+text

I don't see any need for lookbehinds here. Just match the whole string normally and use a capturing group to extract the part that interests you:

Regex r = new Regex(
    @"(?imn)^([A-Z0-9]{1,2}\s+|(?![A-Z]+\s)[0-9A-Z/""'-]{3,}\s+)(?<captured>matching\s+text)$");
resultString = r.Match(s).Groups["captured"].Value;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top