Bioperl push $seq->id to array

https://stackoverflow.com/questions/19895789

30-07-2022
|

문제

I'm fairly new to Perl and Bioperl, I am trying to write a script which will identify instances of identical sequences. To achieve this, I envision a script which takes 2 infiles, the first a multiple alignment in fasta format and the second an accessory file which links the fasta ids to other pertinent information. My approach is to read through the multiple alignment with Bio::SeqIO and to place the file contents in a hash where the sequence is the key and the id is the value, or an array of ids is the value in the case of sequence sharing.

I think it should look something like this:

"AATTTGTTGTTGTACC" => ('Seq1', 'Seq13'),

"TTTCTCTTTCCCAAAG" => 'Seq2',

At the moment, I believe that I am stuck because of an error in trying to push the second id onto the array in the case of sequence sharing (i.e. 'Seq13' in the above example).

Here is the test multiple alignment I'm working with:

    >Seq1
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    >Seq2
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
    >Seq13
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

And below the code I've written so far:

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Data::Dumper;

my $seqs = shift @ARGV or die "please provide a multiple alignment file and an accesory information file: $!\n";
my $info = shift @ARGV or die "please provide a multiple alignment file and an accesory information file: $!\n";

#open(INFO, '<', $info);

my $inseq = Bio::SeqIO->new(
    -file => $seqs,
    -format => "fasta",
    );

my %hts;

while (my $seq = $inseq->next_seq) {
#    print $seq->seq(), "\t", $seq->id, "\n";
    if (defined $hts{$seq->seq()}) {
    print "Sequence already in hash:\t$seq->id\n";
    push @{$hts{$seq->seq()}}, ${$seq->id};
    }
    else {
    $hts{$seq->seq()} = $seq->id;
    }
    print Dumper \%hts
}

And so here is what I would appreciate some assistance with

1) I'm getting an error that I don't quite understand, but believe relates to the push statement --> Can't use string ("Seq1") as an ARRAY ref while "strict refs" in use at ht_sharing.pl line 24, line 3.

2)When the print statement outside of the if loop is active it prints the id as I believe it should (i.e. Seq1) but in the print statement inside the if loop the same call $seq->id instead produces a reference (i.e. Bio::Seq=HASH(0x19e7210)->id). Why is this? I don't understand why printing $seq->id has different outputs within the same while loop.

I'd be really grateful if anyone can offer clarification and, of course, as someone still quite new to this comments on best practice or a better way to approach the problem are also great.

Cheers, Ana

해결책

Your code is pretty close but there are couple of minor issues. The first one is that you want to use the syntax if (exists $hash{$key}) { ... } to see that a key exists, defined tells you whether the value is defined. The second thing is that you are dereferencing your $seq object for no reason.

When you call the method 'next_seq' on a Bio::SeqIO object, it returns a Bio::Seq object. If you call the 'id' method on that Bio::Seq object, it returns the ID as expected, so there is not need to deference anything. Also, there is no need to import Bio::Seq explicitly (that's just a comment, not a problem).

Other comments:

Try to put your print Dumper %hts; call after the while (my $seq ...) loop (i.e., after you have gone through all of your seq objects). Dumping the hash as you are going through the file is not very informative here.
Do you really need to keep the ID associated with the sequence? If you just want to check for duplicate sequences, try $hts{$seq->seq}++ in your loop, and take a peek at the sorted values to see if you have duplicates. That will be faster.
This may be an exercise, but consider the size of your data. If you are attempting to do this on hundreds of millions of sequences (very common these days), you may wait a long time and then run out of memory doing this. I mention that because there are other ways to do this, using clustering methods, but I don't want to dissuade you for trying out a BioPerl solution.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow