Question

When a capture group is followed by a question mark, the backreference appears to be unavailable

my $test = "this is a very long day indeed";

if ($test =~ m/^this.+(very).+(indeed)?/) {
  print "It matched the regex.\n";
  print "$1 :: $2\n";
}

This prints

It matched the first test.
very :: 

Is this normal behaviour? I can't find mention of it in any documentation. I'm trying to match lines in a log file where the second capture group may or may not exist.

Was it helpful?

Solution

It's not a backreference problem. Characters from your last group are matched by .+ but not by your optional capturing group, thus this last group matches an empty string.

The problem is that you use a greedy quantifier that matches all possible characters before. Since your last group is optional, .+ matches all until the end of the line, the regex engine doesn't need to backtrack to match your string (and doesn't need to find "indeed").

A simple way to solve the problem is to use a lazy quantifier instead and an end anchor to force to go to the end of the line (because a lazy quantifier stops as soon as possible):

m/^this.+(very).+?(indeed)?$/

note: if "indeed" aren't always the last characters of the string, you must add .* before the $

OTHER TIPS

This is an additional note about greedyness, which was your problem (that got answered by Casimir).

Realize that regex engines by default, will consume all it can until it finds what satisfies the sub-expression to the right of the current evaluation sub-expression.

Any time you think to use a .+ greedy quantifier with a DOT metachar should raise a red flag to think twice. It will blow right past what you possibly intend to mach if it can.

For this reason, try to replace this with something more specific that doesn't have a chance to go past your intended target.

Modifying your sample regex slightly shows how this could happen.

 my $test = "this is a very long day indeed, very long.";

 if ($test =~ m/

      ^
      ( this )               # (1)
      ( .+ )                 # (2)
      ( very )               # (3)
      ( .+ )                 # (4)
      ( indeed )?            # (5)

 /x) {
   print "All  = '$&'\n";
   print "grp1 = '$1'\n";
   print "grp1 = '$2'\n";
   print "grp1 = '$3'\n";
   print "grp1 = '$4'\n";
 }

 # Output >>
 # 
 # All  = 'this is a very long day indeed, very long.'
 # grp1 = 'this'
 # grp1 = ' is a very long day indeed, '
 # grp1 = 'very'
 # grp1 = ' long.'
 # 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top