Question

I have a fasta file formatted like the below

>gi|84341511|gb|DU991381.1| KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence
ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGT
AAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGG
AGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTAC
GAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAA
CTTCAAAGTAAACAATGTGTTGTGTC
>gi|84341510|gb|DU991380.1| KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence
GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTA
TCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCC
CAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTT
ATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF

From the header I have separated ID as 4th position(DU991381.1) separated by a pipeline (|) and info (E.g. KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence) by spliting the header.

However I cannot make the sequences in a line and push them to an array.

This is my code

#!/usr/bin/perl

use warnings;
use strict;

my $stData = "C:\\data.txt";

open (DATA, $stData);

my @stID   = "";
my @stInfo = "";
my @stSeq  = "";

while (my $stLine = <DATA>)
{
    chomp($stLine);
    if ($stLine =~ /^>+/)
    {
        my @lArray = split/\|/,$stLine;
        push (@stID,$lArray[3]);
        push (@stInfo,$lArray[4]);
    }
    else #This is the part I cannot figure out
    {
        my $stSeq = $stLine.$stLine;
        print $stSeq,"\n";
    }
}


close (DATA);

So the outcome that I want is like this, an array with sequences in a line.

@stSeq  = [ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGTAAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGGAGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTACGAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAACTTCAAAGTAAACAATGTGTTGTGTC, GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTATCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCCCAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTTATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF]

Please, help me out! Cheers.

Was it helpful?

Solution

...

my @stID   = (); # empty arrays
my @stInfo = ();
my @stSeq  = ();
my $tmpStSeq = ""; # temporary empty string


while (my $stLine = <DATA>) {
    chomp($stLine);
    if ($stLine =~ /^>+/) {
        my @lArray = split/\|/,$stLine;
        push (@stID,$lArray[3]);
        push (@stInfo,$lArray[4]);

        if ($tmpStSeq ne "") { #previous seq is pushed when a next ">" is found
          push @stSeq, $tmpStSeq;
          $tmpStSeq = ""; # and previous accumulations are flushed
        }
        next; # adding this you avoid `else block`
    }
    # else {
    $tmpStSeq .= $stLine; # accumulates into temporary (of course, no `my` needed)
    # }        
}

if ($tmpStSeq ne "") { # pushes last seq into stSeq
  push @stSeq, $tmpStSeq;
  $tmpStSeq = "";
}

# here you can use your arrays.

...

Maybe using an array of hashes would be better.

OTHER TIPS

I'd do:

my %data;
my ($id, $info);
while (my $stLine = <DATA>) {
    chomp($stLine);
    if ($stLine =~ /^>+/) {
        ($id, $info) = (split/\|/,$stLine)[3,4];
    }
    else {
        $data{$id}{$info} .= $stLine;
    }
}
dump%data;

__DATA__
>gi|84341511|gb|DU991381.1| KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence
ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGT
AAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGG
AGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTAC
GAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAA
CTTCAAAGTAAACAATGTGTTGTGTC
>gi|84341510|gb|DU991380.1| KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence
GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTA
TCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCC
CAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTT
ATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF

output:

(
  "DU991381.1",
  {
    " KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence" => "ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGTAAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGGAGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTACGAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAACTTCAAAGTAAACAATGTGTTGTGTC",
  },
  "DU991380.1",
  {
    " KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence" => "GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTATCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCCCAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTTATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF",
  },
)
while (my $stLine = <DATA>)
{
    chomp($stLine);
    if ($stLine =~ /^>+/)
    {
        my @lArray = split/\|/,$stLine;
        push (@stID,$lArray[3]);
        push (@stInfo,$lArray[4]);
        print "$stSeq\n";           # You could print here to get the whole last
                                    # seq before you start a new one.
                                    # Of course this would be the one from before
                                    # the header parsed here.

        $stSeq = "";                # Start a new sequence when you hit a header
    }
    else #This is the part I cannot figure out
    {
        $stSeq = $stSeq.$stLine;    # No my here and append to the sequence so far.
        print $stSeq,"\n";          # This will of course only print the seq so far.
                                    # Probably the above print is better.
    }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top