Perl all matches of a regexp in a given string

https://stackoverflow.com/questions/19010860

29-06-2022
|

문제

Perl's regexp matching is left-greedy, so that the regexp

/\A (a+) (.+) \z/x

matching the string 'aaab', will set $1='aaa' and $2='b'. (The \A and \z are just to force start and end of the string.)

You can also give non-greedy qualifiers, as

/\A (a+?) (.+?) \z/x

This will still match, but give $1='a' and $2='aab'.

But I would like to check all possible ways to generate the string, which are

$1='aaa' $2='b'
$1='aa'  $2='ab'
$1='a'   $2='aab'

The first way corresponds to the default left-greedy behaviour, and the third way corresponds to making the first match non-greedy, but there may be ways in between those extremes. Is there a regexp engine (whether Perl's, or some other such as PCRE or RE2) which can be made to try all possible ways that the regexp specified generates the given string?

Among other things, this would let you implement 'POSIX-compatible' regexp matching where the longest total match is picked. In my case I really would like to see every possibility.

(One way would be to munge the regexp itself, replacing the + modifier with {1,1} on the first attempt, then {1,2}, {1,3} and so on - for each combination of + and * modifiers in the regexp. That is very laborious and slow, and it's not obvious when to stop. I hope for something smarter.)

Background

To answer Jim G.'s question on what problem this might solve, consider a rule-based translation system between two languages, given by the rules

translate(any string of one or more 'a' . y) = 'M' . translate(y)
translate('ab') = 'U'

Then there is a possible result of translate('aaab'), namely 'MU'. You might try to put these rules into Perl code based on regexps, as

our @m;
my @rules = (
  [ qr/\A (a+) (.*) \z/x => sub { 'M' . translate($m[1]) } ],
  [ qr/\A ab        \z/x => sub { 'U'                    } ],
);

where translate runs over each of @rules and tries to apply them in turn:

sub translate {
    my $in = shift;
    foreach (@rules) {
        my ($lhs, $rhs) = @$_;
        $in =~ $lhs or next;
        local @m = ($1, $2);
        my $r = &$rhs;
        next if index($r, 'fail') != -1;
        return $r;
    }
    return 'fail';
}

However, calling translate('aaab') returns 'fail'. This is because it tries to apply the first rule matching (a+)(.*) and the regexp engine finds the match with the longest possible string of 'a'.

Using the answer suggested by ikegami, we can try all ways in which the regular expression generates the string:

use re 'eval';
sub translate {
    my $in = shift;
    foreach (@rules) {
        my ($lhs, $rhs) = @$_;
        local our @matches;
        $in =~ /$lhs (?{ push @matches, [ $1, $2 ] }) (*FAIL)/x;
        foreach (@matches) {
            local @m = @$_;
            my $r = &$rhs;
            next if index($r, 'fail') != -1;
            return $r;
        }
    }
    return 'fail';
}

Now translate('aaab') returns 'MU'.

해결책

local our @matches;
'aaab' =~ /^ (a+) (.+) \z (?{ push @matches, [ $1, $2 ] }) (*FAIL)/x;

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow