Pregunta

I have a string in a file, which is to be read by Perl, and can either be:

previous content ending with a linebreak
keyword: content
next content

or

previous content, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c closed by matching parenthesis}
next content

In either case, I successfully loaded the contents, from the beginning of previous content, till the end of next, in a string, call it $str.

Now, I want to extract the stuff between the linebreak that ends previous content, and the linebreak before next content.

So I used a regex on $str like this:

if($str =~
        /.*\nkeyword: # keyword: is always constant, immediately after a newline
        (?!\{+)       # NO { follows
        \s+(?!\{+)    # NO { with a heading whitespace
        \s*           # white space between keyword: and content
        (?!\{+)       # no { immediately before content 
                      # question : should the last one be a negative lookbehind AFTER the check for content itself?
        ([^\s]+)      # the content, should be in $1;
        (?!\{+)       # no trailing { immediately after content
        \s+           # delimited by a whitespace, ignore what comes afterwards
        |             # or
        /.*\nkeyword: # keyword: is always constant, immediately after a newline
        (?=\s*{*\s*)*) # any mix of whitespace and {
        (?=\{+)       # at least one {
        (?=\s*{*\s*)*) # again any mix of whitespace and {
        ([^\{\}]+)    # no { or }
        (?=\s*}*\s*)*) # any mix of whitespace and }
        (?=\}+)       # at least one }
        (?=\s*}*\s*)*) # again any mix of whitespace and }
) { #do something with $1}

I realize that this one is not really addressing multiline information with nested parenthesis; however, it should capture objects in form keyword: {{ content} }

However, while I am able to capture the content in $1 in case of

keyword: content 

form, I am unable to capture

keyword: {multiline with nested
{parenthesis} } 

I finally did implement it using a simple counter based parser, instead of regex. I would love to know how can I do this in regex, to capture objects of the second form, with an explanation of the regex command, please.

Also, where did my formulation go wrong that it does not even capture single line content with multiple (but matched) heading and trailing parenthesis?

¿Fue útil?

Solución

You can use this:

#!/usr/bin/perl
use strict;
use warnings;

my $str = "previous content ending with a linebreak
keyword: content
next content

previous contnet, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c losed by matching parethesis}
next content";

while ($str =~ /\nkeyword:  
            (?| # branch reset: i.e. the two capture groups have the same number
                \s*
                ({ (?> [^{}]++ | (?1) )*+ }) # recursive pattern
              |               # OR
                \h*
                (.*+)   # capture all until the end of line
            )   # close the branch reset group
             /xg ) {

    print "$1\n";
}

This pattern try a possible content with nested curly brackets, if curly brackets are not found or are not balanced, the second alternative is tried and match only the content of the line (since the dot can't match newlines).

The branch reset feature (?|..|..) is useful to give the same number to the capturing group of each part of the alternation.

recursive pattern details:

(                 # open the capturing group 1
    {             # literal opening curly bracket
    (?>           # atomic group: possible content between brackets
        [^{}]++   # all that is not a curly bracket
      |           # OR
        (?1)      # recurse to the capturing group 1 (!here is the recursion!)
    )*+           # repeat the atomic group zero or more times
    }             # literal closing curly bracket
)                 # close the capturing group 1

In this subpattern I use an atomic group (?>...) and possessive quantifiers ++ and *+ to avoid backtracking the most possible.

Otros consejos

How about something like this?

if ($str =~ /keyword:\s*{(.*)}/s) {
    my $key = $1;
    if ($key =~ /([^{}]*)/) {
        print "$1\n";
    }
    else {
        print "$key\n";
    }
}
elsif ($str =~ /keyword:\s*(.*)/) {
    print "$1\n";
}

[^{|^}] is looking for a chunk of letters that doesn't have any braces in it i.e. the most inner letters of the nested braces.

The s modifier allows you to look at multiple lines even when using .*. However, you don't want to look at multiple lines for keywords without braces, so that part is in the elsif statement.

Do you need to have the same number of matching braces? For example, should keyword: {foo{bar{hello}}} output {{{hello}}}? If so, I feel like it would be better to stick with counters.

Edit:

For the input

keyword: {multiline 
with nested {parenthesis} }

if you want the output

{multiline with nested {parenthesis} }

I believe that would be

if ($str =~ /keyword:\s*({.*})/s) {
    my $match = $1;
    $match =~ s/\n//g;
    print "$match\n";
}
elsif ($str =~ /keyword:\s*(.*)/) {
    print "$1\n";
}
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top