Convert GenBank Flatfiles to FASTA

https://stackoverflow.com/questions/6336853

27-10-2019
|

문제

I need to parse a preliminary GenBank Flatfile. The sequence hasn't been published yet, so I can't look it up by accession and download a FASTA file. I'm new to Bioinformatics, so could someone show me where I could find a BioPerl or BioPython script to do this myself? Thanks!

해결책

You need the Bio::SeqIO module to read or write out bioinformatics data. The SeqIO HOWTO should tell you everything you need to know, but here's a small read-a-GenBank-file script in Perl to get you started!

다른 팁

I have the Biopython solution for you here. I will firstly assume your genbank file relates to a genome sequence, then I will provide a different solution assuming it was instead a gene sequence. Indeed it would have been helpful to have known which of these you are dealing with.

Genome Sequence Parsing:

Parse in your custom genbank flatfile from file by:

from Bio import SeqIO
record = SeqIO.read("yourGenbankFileDirectory/yourGenbankFile.gb","genbank")

If you just want the raw sequence then:

rawSequence = record.seq.tostring()

Now perhaps you need a name for this sequence, to give the sequence a ">header" before making the .fasta. Let's see what names came with the genbank .gb file:

nameSequence = record.features[0].qualifiers

This should return a dictionary with various synonyms of that whole sequence as annotated by author of that genbank file

Gene Sequence Parsing:

Parse in your custom genbank flatfile from file by:

from Bio import SeqIO
record = SeqIO.read("yourGenbankFileDirectory/yourGenbankFile.gb","genbank")

To get a list of raw sequences for the gene/list of all genes then:

rawSequenceList = [gene.extract(record.seq.tostring()) for gene in record.features]

To get a list of names for each gene sequence (more precisely a dictionary of synonyms for each gene)

nameSequenceList = [gene.qualifiers for gene in record.features]

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow