Question

I have a string of arbitrary length, and starting at position p0, I need to find the first occurrence of one of three 3-letter patterns.

Assume the string contain only letters. I need to find the count of triplets starting at position p0 and jumping forward in triplets until the first occurrence of either 'aaa' or 'bbb' or 'ccc'.

Is this even possible using just a regex?

Was it helpful?

Solution

Moritz says this might be faster than a regex. Even if it's a little slower, it's easier to understand at 5 am. :)

             #0123456789.123456789.123456789.  
my $string = "alsdhfaaasccclaaaagalkfgblkgbklfs";  
my $pos    = 9;  
my $length = 3;  
my $regex  = qr/^(aaa|bbb|ccc)/;

while( $pos < length $string )    
    {  
    print "Checking $pos\n";  

    if( substr( $string, $pos, $length ) =~ /$regex/ )
        {
        print "Found $1 at $pos\n";
        last;
        }

    $pos += $length;
    }

OTHER TIPS

$string=~/^   # from the start of the string
            (?:.{$p0}) # skip (don't capture) "$p0" occurrences of any character
            (?:...)*?  # skip 3 characters at a time,
                       # as few times as possible (non-greedy)
            (aaa|bbb|ccc) # capture aaa or bbb or ccc as $1
         /x;

(Assuming p0 is 0-based).

Of course, it's probably more efficient to use substr on the string to skip forward:

substr($string, $p0)=~/^(?:...)*?(aaa|bbb|ccc)/;

You can't really count with regexes, but you can do something like this:

pos $string = $start_from;
$string =~ m/\G         # anchor to previous pos()
            ((?:...)*?) # capture everything up to the match
            (aaa|bbb|ccc)
            /xs  or die "No match"
my $result = length($1) / 3;

But I think it's a bit faster to use substr() and unpack() to split into triple and walk the triples in a for-loop.

(edit: it's length(), not lenght() ;-)

The main part of this is split /(...)/. But at the end of this, you'll have your positions and occurrence data.

my @expected_triplets = qw<aaa bbb ccc>;
my $data_string      
    = 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
    ;
my $place          = 0;
my @triplets       = grep { length } split /(...)/, $data_string;
my %occurrence_for = map { $_, [] } @expected_triplets;
foreach my $i ( 0..@triplets ) {
    my $triplet = $triplets[$i];
    push( @{$occurrence_for{$triplet}}, $i ) if exists $occurrence_for{$triplet};
}

Or for simple counting by regex (it uses Experimental (??{}))

my ( $count, %count );
my $data_string      
    = 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
    ;
$data_string =~ m/(aaa|bbb|ccc)(??{ $count++; $count{$^N}++ })/g;

If speed is a serious concern, you can, depending on what the 3 strings are, get really fancy by creating a tree (e.g. Aho-Corasick algorithm or similar).

A map for every possible state is possible, e.g. state[0]['a'] = 0 if no strings begin with 'a'.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top