Is there a better way to write Perl regexes with /x so the code is still easy to read?

https://stackoverflow.com/questions/987438

13-09-2019
|

Question

I ran Perl::Critic on one of my scripts, and got this message:

Regular expression without "/x" flag at line 21, column 26. See page 236 of PBP.

I looked up the policy information here, and I understand that writing regular expressions in extended mode will help anyone who is looking at the code.

However, I am stuck as how to convert my code to use the /x flag.

CPAN Example:

# Match a single-quoted string efficiently...

m{'[^\\']*(?:\\.[^\\']*)*'};  #Huh?

# Same thing with extended format...

m{
    '           # an opening single quote
    [^\\']      # any non-special chars (i.e. not backslash or single quote)
    (?:         # then all of...
        \\ .    #    any explicitly backslashed char
        [^\\']* #    followed by an non-special chars
    )*          # ...repeated zero or more times
    '           # a closing single quote
}x;

This makes sense if you only look at the regex.

My Code:

if ($line =~ /^\s*package\s+(\S+);/ ) {

I am not exactly sure how to use an extended regex inside of an if statement. I can write it like this:

    if (
        $line =~ /
        ^\s*    # starting with zero or more spaces
        package
        \s+     # at least one space
        (\S+)   # capture any non-space characters
        ;       # ending in a semi-colon
        /x
      )
    {

And this works, but I think this is almost harder to read than the original. Is there a better way (or a best practice way) to write this? I guess I could create a variable using qr//.

I'm not really looking for advice on re-writing this specific regex (although if I can improve it, I'll take advice) - I'm more looking for advice on how to expand a regex inside of an if statement.

I know Perl::Critic is just a guideline, but it would be nice to follow it.

Thanks in advance!

EDIT: So after receiving a few answers, it became clear to me that making a regex multi-line with comments is not always necessary. People who understand basic regex should be able to understand what my example was doing - the comments I added were maybe a little unnecessary and verbose. I like the idea of using the extended regex flag, but still embedding spaces in the regex to make each part of the regex a little more clear. Thanks for all the input!

Solution

Well, I really don't think you should waste vertical screen real estate on this. On the other hand, if I were to write this pattern over several lines, I would have used braces and indented the pattern:

if ($line =~ m{
        \A \s*
        package
        \s+
        (\S+)
        \s* ;
    }x 
) {

IMHO, the following version is perfectly fine:

if ( $line =~ m{ \A \s* package \s+ (\S+) \s* ; }x  ) {

in terms of getting the benefit of m//x.

The comments are completely unnecessary in this case because you are not doing anything tricky. I did add \s* before the semi-colon because sometimes people set the semi-colon apart from the package name and that should not throw off your match.

OTHER TIPS

Never write a comment that says what the code says. Comments should tell you why the code says what it says. Take a look at this monstrosity, without the comments it is very difficult to see what is going on, but the comments make it clear what is trying to be matched:

require 5.010;
my $sep         = qr{ [/.-] }x;               #allowed separators    
my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century 
my $any_decade  = qr/ [0-9]{2} /x;            #match any decade or 2 digit year
my $any_year    = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year

#match the 1st through 28th for any month of any year
my $start_of_month = qr/
    (?:                         #match
        0?[1-9] |               #Jan - Sep or
        1[0-2]                  #Oct - Dec
    )
    ($sep)                      #the separator
    (?: 
        0?[1-9] |               # 1st -  9th or
        1[0-9]  |               #10th - 19th or
        2[0-8]                  #20th - 28th
    )
    \g{-1}                      #and the separator again
/x;

#match 28th - 31st for any month but Feb for any year
my $end_of_month = qr/
    (?:
        (?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec
        ($sep)                  #the separator
        31                      #the 31st
        \g{-1}                  #and the separator again
        |                       #or
        (?: 0?[13-9] | 1[0-2] ) #match all months but Feb
        ($sep)                  #the separator
        (?:29|30)               #the 29th or the 30th
        \g{-1}                  #and the separator again
    )
/x;

#match any non-leap year date and the first part of Feb in leap years
my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x;

#match 29th of Feb in leap years
#BUG: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
my $feb_in_leap = qr/
    0?2                         #match Feb
    ($sep)                      #the separtor
    29                          #the 29th
    \g{-1}                      #the separator again
    (?:
        $any_century?           #any century
        (?:                     #and decades divisible by 4 but not 100
            0[48]       | 
            [2468][048] |
            [13579][26]
        )
        |
        (?:                     #or match centuries that are divisible by 4
            16          | 
            [2468][048] |
            [3579][26]
        )
        00                      
    )
/x;

my $any_date  = qr/$non_leap_year|$feb_in_leap/;
my $only_date = qr/^$any_date$/;

It is pretty much your call as to the value added by such extra information.

Sometimes you're right, it doesn't add anything to explain what is going on and just makes the code look messy, but for complex regular expressions, the x flag can be a boon.

Actually, this "making a call" regarding the added value of additional information can be quite difficult.

I cannot remember how many times I've seen legacy code where beautifully formatted comments have not been maintained and so drift away from what the code is doing. In fact, when I was a lot less-experienced, I went completely up the wrong path because a comment associated with a piece of code was old and hadn't been maintained.

Edit: In some ways, the CPAN example is not really that useful. When using the x flag to add comments to describe a complex regexp, I tend to describe the components that the regexp is trying to match rather than just describe the regexp "bits" themselves. For example, I'd write things like:

the first component (area and district) of the UK postcode, or
the international area code for the UK, or
any UK mobile phone number.

which tells me more than

one or two letters, followed by a number, optionally followed by a letter, or
two four digits together, or
a zero, followed by four decimal digits, a dash and then six decimal digits.

My feeling would be to leave the regexp comments out in this case. Your gut feeling is right!

Seeing this topic is about alternative ways to write regular expressions, there are ways to write complicated regular expressions without variables, and without comments, and it still be useful.

I reflowed Chas Owens date validating regex to the new declarative form available in Perl-5.10, which has numerous benefits.

Tokens in the regex are reusable
Anyone printing the regex later will still see the whole logic tree.

It may not be everyones kettle of fish, but for extremely complex things such as date validating it can be handy ( ps: in the real world, please use a module for date stuff, don't DIY , this is just an example to learn from )

#!/usr/bin/perl 
use strict;
use warnings;
require 5.010;

#match the 1st through 28th for any month of any year
my $date_syntax = qr{
    (?(DEFINE)
        (?<century>
            ( 1[6-9] | [2-9][0-9] )
        )
        (?<decade>
            [0-9]{2} (?!\d)
        )
        (?<year>
            (?&century)? (?&decade)(?!\d)
        )
        (?<leapdecade> (
            0[48]       | 
            [2468][048] |
            [13579][26]
            )(?!\d)
        )
        (?<leapcentury> (
            16          | 
            [2468][048] |
            [3579][26]
            )
        )   
        (?<leapyear>
            (?&century)?(?&leapdecade)(?!\d)
            |
            (?&leapcentury)00(?!\d)
        )
        (?<monthnumber>      ( 0?[1-9] | 1[0-2] )(?!\d)                  )
        (?<shortmonthnumber> ( 0?[469] | 11     )(?!\d)                  )
        (?<longmonthnumber>  ( 0?[13578] | 1[02] )(?!\d)                 )
        (?<nonfebmonth>      ( 0?[13-9] | 1[0-2] )(?!\d)                 )
        (?<febmonth>         ( 0?2 )(?!\d)                               )
        (?<twentyeightdays>  ( 0?[1-9] | 1[0-9] | 2[0-8] )(?!\d)         )
        (?<twentyninedays>   ( (?&twentyeightdays) | 29 )(?!\d)          )
        (?<thirtydays>       ( (?&twentyeightdays) | 29 | 30 )(?!\d)     )
        (?<thirtyonedays>    ( (?&twentyeightdays) | 29 | 30 | 31 )(?!\d))
        (?<separator>        [/.-]                              )               #/ markdown syntax highlighter fix
        (?<ymd>
            (?&leapyear) (?&separator) (?&febmonth) (?&separator) (?&twentyninedays) (?!\d)
            |
            (?&year) (?&separator) (?&longmonthnumber) (?&separator) (?&thirtyonedays) (?!\d)
            |
            (?&year) (?&separator) (?&shortmonthnumber) (?&separator) (?&thirtydays) (?!\d)
            |
            (?&year) (?&separator) (?&febmonth) (?&separator) (?&twentyeightdays) (?!\d)
        )
        (?<mdy>
            (?&febmonth) (?&separator) (?&twentyninedays) (?&separator) (?&leapyear)  (?!\d)
            |
            (?&longmonthnumber) (?&separator) (?&thirtyonedays) (?&separator) (?&year) (?!\d)
            |
            (?&shortmonthnumber) (?&separator) (?&thirtydays) (?&separator) (?&year) (?!\d)
            |
            (?&febmonth) (?&separator) (?&twentyeightdays) (?&separator) (?&year) (?!\d)
        )
        (?<dmy>
            (?&twentyninedays) (?&separator) (?&febmonth) (?&separator) (?&leapyear)  (?!\d)
            |
            (?&thirtyonedays) (?&separator) (?&longmonthnumber) (?&separator)(?&year) (?!\d)
            |
            (?&thirtydays) (?&separator) (?&shortmonthnumber) (?&separator) (?&year) (?!\d)
            |
            (?&twentyeightdays) (?&separator) (?&febmonth) (?&separator)  (?&year) (?!\d)
        )
        (?<date>
            (?&ymd) | (?&mdy) | (?&dmy)
        )
        (?<exact_date>
           ^(?&date)$
       )
    )
}x;

my @test = ( "2009-02-29", "2009-02-28", "2004-02-28", "2004-02-29", "2005-03-31", "2005-04-31", "2005-05-31", 
    "28-02-2009","02-28-2009",        
);

for (@test) {
  if ( $_ =~ m/(?&exact_date) $date_syntax/x ) {
    print "$_ is valid\n";
  }
  else {
    print "$_ is not valid\n";
  }

  if ( $_ =~ m/^(?&ymd) $date_syntax/x ) {
    print "$_ is valid ymd\n";
  }
  else {
    print "$_ is not valid ymd\n";
  }


  if ( $_ =~ m/^(?&leapyear) $date_syntax/x ) {
    print "$_ is leap (start)\n";
  }
  else {
    print "$_ is not leap (start)\n";
  }

  print "\n";
}

Note the addition of the (?!\d) snippets, which are added so that

"45" wont match ~= m{(?&twentyeightdays) $syntax} due to '4' matching 0?[4]

Seems like this is more a question of how to consistently indent a multiline if condition...to which there are a great many answers. What really matters is consistency. If you use perltidy or some other formatter, be consistent with what it comes up with (with your configuration). I would indent the contents of the regex one level from the delimiters, though.

Your post shows one major flaw in running existing code through something like Perl::Critic - ~~you~~ the CPAN example left out a * from the original regex. If you do a lot of "cleanup", you can expect to introduce bugs, so I hope for your sake that you have a good test suite.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow