Using Perl to iterate through a string 3 positions at a time
-
28-05-2021 - |
Question
I have written the following code in Perl. I want to iterate through a string 3 positions (characters) at a time. If TAA
, TAG
, or TGA
(stop codons) appear, I want to print till the stop codons and remove the rest of the characters.
Example:
data.txt
ATGGGTAATCCCTAGAAATTT
ATGCCATTCAAGTAACCCTTT
Answer:
ATGGGTAATCCCTAG (last 6 characters removed)
ATGCCATTCAAGTAA (last 6 characters removed)
(Each sequence begins with ATG).
Code:
#!/usr/bin/perl -w
open FH, "data.txt";
@a=<FH>;
foreach $tmp(@a)
{
for (my $i=0; $i<(length($tmp)-2); $i+=3)
{
if ($tmp=~/(ATG)(\w+)(TAA|TAG|TGA)\w+/)
{
print "$1$2$3\n";
}
else
{
print "$tmp\n";
}
$tmp++;
}
}
exit;
However, my code is not giving the correct result. There should not be any overlaps in the characters (I want to move every 3 characters).
Can someone suggest how to fix the error?
Thanks!
Solution
Script:
#!/usr/bin/perl
use strict;
use warnings;
open FH, "data.txt";
my @a = <FH>;
foreach (@a) {
print /^(ATG(...)*?(TAA|TAG|TGA))/? $1 : $_, "\n";
}
Output:
ATGGGTAATCCCTAG
ATGCCATTCAAGTAA
OTHER TIPS
I think this code will do. It uses \w{3}
- three-symbol codons as you need.
#!/usr/bin/perl -w
open FH, "data.txt";
@a=<FH>;
foreach $tmp(@a) {
if ($tmp=~ /^(ATG(?:\w{3})*(?:TAA|TAG|TGA)).*/) {
print "$1\n";
} else {
print "$tmp\n";
}
}
You say you want to remove everything after the first stop codon. If so, all you need is
while (<FH>) {
s/(?<=TAA|TAG|TGA).*//;
print;
}
But then there's the mystical "I want to iterate through a string 3 positions (characters) at a time" requirement. That doesn't make any sense. Perhaps you want the match to occur at a position that's divisible by three? If so, you'd use
s/^(?:.{3})*?(?:TAA|TAG|TGA)\K.*//; # Requires 5.10+
s/^((?:.{3})*?(?:TAA|TAG|TGA)).*/$1/; # Backwards compatible
May I suggest a reading of perlretut (about 4 paragraphs down from here)? It actually covers almost exactly this situation with avoiding overlaps and finding stop codons.