How do I get gene features in FASTA nucleotide format from NCBI using Perl?

https://stackoverflow.com//questions/22073986

23-12-2019
|

Question

I am able to download a FASTA file manually that looks like:

>lcl|CR543861.1_gene_1...
ATGCTTTGGACA...
>lcl|CR543861.1_gene_2...
GTGCGACTAAAA...

by clicking "Send to" and selecting "Gene Features", FASTA Nucleotide is the only option (which is fine because that's all I want) on this page.

With a script like this:

#!/usr/bin/env perl
use strict;
use warnings;
use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                       -db      => 'nucleotide',
                                       -id      => 'CR543861',
                                       -rettype => 'fasta');
my $file = 'CR543861.fasta';
$factory->get_Response(-file => $file);

I get a file that looks like:

>gi|49529273|emb|CR543861.1| Acinetobacter sp. ADP1 complete genome
GATATTTTATCCACA...

with the whole genomic sequence lumped together. How do I get information like in the first (manually downloaded) file?

I looked at a couple of other posts:

As well as this section from EUtilities Cookbook.

I tried fetching and saving a GenBank file (since it seems to have separate sequences for each gene in the .gb file I get), but when I go work with it using Bio::SeqIO, I will get only 1 large sequence.

Solution

With that accession number and return type, you are getting the complete genome sequence. If you want to get the individual gene sequences, specify that you want the complete genbank file, then parse out the genes. Here is an example:

#!/usr/bin/env perl

use 5.010;
use strict;
use warnings;
use Bio::SeqIO;
use Bio::DB::EUtilities;


my $factory = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                       -email   => 'foo@bar.com',
                                       -db      => 'nucleotide',
                                       -id      => 'CR543861',
                                       -rettype => 'gb');
my $file = 'CR543861.gb';
$factory->get_Response(-file => $file);

my @gene_features = grep { $_->primary_tag eq 'gene' } 
                    Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;

for my $feat_object (@gene_features) {
    for my $tag ($feat_object->get_all_tags) {
        # open a filehandle here for writing each to a separate file
        say ">",$feat_object->get_tag_values($tag);
        say $feat_object->spliced_seq->seq;
        # close it!
    } 
}

This will write each gene to the same file (if you redirect it, now it just writes to STDOUT) but I indicated where you could make a small change to write them to separate files. Parsing genbank can be a bit tricky at times, so it is always helpful to read the docs and in particular, the excellent Feature Annotation HOWTO.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow