Matching n parentheses in perl regex

https://stackoverflow.com/questions/3066143

28-09-2019
|

Question

I've got some data that I'm parsing in Perl, and will be adding more and more differently formatted data in the near future. What I would like to do is write an easy-to-use function, that I could pass a string and a regex to, and it would return anything in parentheses. It would work something like this (pseudocode):

sub parse {
  $data = shift;
  $regex = shift;

  $data =~ eval ("m/$regex/")
  foreach $x ($1...$n)
  {
    push (@ra, $x); 
  }
  return \@ra;
}

Then, I could call it like this:

@subs = parse ($data, '^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)');

As you can see, there's a couple of issues with this code. I don't know if the eval would work, the 'foreach' definitely wouldn't work, and without knowing how many parentheses there are, I don't know how many times to loop.

This is too complicated for split, so if there's another function or possibility that I'm overlooking, let me know.

Thanks for your help!

Solution

In list context, a regular expression will return a list of all the parenthesized matches.

So all you have to do is:

my @matches = $string =~ /regex (with) (parens)/;

And assuming that it matched, @matches will be an array of the two capturing groups.

So using your regex:

my @subs = $data =~ /^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)/;

Also, when you have long regexes, Perl has the x modifier, which goes after the closing regex delimiter. The x modifier allows you to put white-space and newlines inside the regex for increased readability.

If you are worried about the capturing groups that might be zero length, you can pass the matches through @subs = grep {length} @subs to filter them out.

OTHER TIPS

Then, I could call it like this:

@subs = parse($data, 
          '^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)');

Instead, call it like:

parse($data, 
    qr/^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)/);

Further, your task would be made simpler if you can use named captures (i.e. Perl 5.10 and later). Here is an example:

#!/usr/bin/perl

use strict; use warnings;

my %re = (
    id => '(?<id> [0-9]+ )',
    name => '(?<name> \w+ )',
    value => '(?<value> [0-9]+ )',
);

my @this = (
    '123,one:12',
    '456,two:21',
);

my @that = (
    'one:[12],123',
    'two:[21],456',
);

my $this_re = qr/$re{id}   ,   $re{name}    : $re{value}/x;
my $that_re = qr/$re{name} : \[$re{value}\] , $re{id}   /x;

use YAML;

for my $d ( @this ) {
    print Dump [ parse($d, $this_re) ];
}

for my $d ( @that ) {
    print Dump [ parse($d, $that_re) ];
}

sub parse {
    my ($d, $re) = @_;
    return unless $d =~ $re;
    return my @result = @+{qw(id name value)};
}

Output:

---
- 123
- one
- 12
---
- 456
- two
- 21
---
- 123
- one
- 12
---
- 456
- two
- 21

You are trying to parse a complex expression with a regex - which is an insufficient tool for the job. Recall that regular expressions cannot parse higher grammars. For intuition, any expression which might be nested cannot be parsed with regex.

When you want to find text inside of pairs of parenthesis, you want to use Text::Balanced.

But, that is not what you want to do, so it will not help you.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow