Question

I have a question I am hoping someone could help with...

I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).

The variable contains data such as these:

$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"

The only bits I am interested in from the above examples are:

@array = ("cat_dog","horse","rabbit","chicken-pig")
@array = ("elephant","MOUSE_RAT","spider","lion-tiger") 
@array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")

The problem I am having:

I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.

But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.

Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...

For example:

Saturn was long thought to be the only ringed planet, however, this is now known not to be the case. 
                                                     ^        ^
                                                     |        |
                                    note the spaces here and here

I am not interested in any cases where the comma is followed by a space (as shown above).

I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)

I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.

Was it helpful?

Solution

How about

[^,\s]+(,[^,\s]+)+

which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.

Further to comments

To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to @matches.

my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my @matches;

while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
    push(@matches, split(/,/, $&));
}   

print join("\n",@matches),"\n";

OTHER TIPS

Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently

my @array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split

Going from right to left:

  1. Split the line on spaces (split)
  2. Leave only elements having no comma at the either end but having one inside (grep)
  3. Split each such element into parts (map and split)

That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.

I hope this is clear and suits your needs:

 #!/usr/bin/perl
    use warnings;
    use strict;

    my @strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
    "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf", 
     "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew", 
     "Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
     "Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");

    my $regex = qr/
                \s #From your examples, it seems as if every
                   #comma separated list is preceded by a space.
                (
                    (?:
                        [^,\s]+ #Now, not a comma or a space for the
                                 #terms of the list

                        ,        #followed by a comma
                    )+
                    [^,\s]+     #followed by one last term of the list
                )
                /x;

    my @matches = map {
                    $_ =~ /$regex/;
                    if ($1) {
                        my $comma_sep_list = $1;
                        [split ',', $comma_sep_list];
                    }
                    else {
                        []
                    }
                } @strs;
$var =~ tr/ //s;    
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
      push (@arr, $&);
    }

the regular expression matches three cases :

(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,)      : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top