extract data from gb file using biopython
Question
I have a gb file and I need to extract some specific features from the file : protein coding genes names and size.
LOCUS NC_008137 15318 bp DNA linear MAM 15-APR-2009
DEFINITION Phalanger interpositus mitochondrion, complete genome.
ACCESSION NC_008137
VERSION NC_008137.1 GI:108793518
DBLINK Project: 17043
KEYWORDS .
SOURCE mitochondrion Phalanger interpositus (Stein's cuscus)
ORGANISM Phalanger interpositus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Metatheria; Diprotodontia; Phalangeridae; Phalanger.
REFERENCE 1 (bases 1 to 15318)
AUTHORS Munemasa,M., Nikaido,M., Donnellan,S., Austin,C.C., Okada,N. and
Hasegawa,M.
TITLE Phylogenetic analysis of diprotodontian marsupials based on
complete mitochondrial genomes
JOURNAL Genes Genet. Syst. 81 (3), 181-191 (2006)
PUBMED 16905872
REFERENCE 2 (bases 1 to 15318)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (12-JUN-2006) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 15318)
AUTHORS Munemasa,M., Nikaido,M., Donnellan,S., Austin,C.C., Okada,N. and
Hasegawa,M.
TITLE Direct Submission
JOURNAL Submitted (08-NOV-2005) Tokyo Institute of Technology, Graduate
School of Bioscience and Biotechnology; Nagatsuta-cho 4259-B-21,
Midori-ku, Kanagawa 226-8501, Japan
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AB241057.
Genome sequence lacks part of non-coding region.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..15318
/organism="Phalanger interpositus"
/organelle="mitochondrion"
/mol_type="genomic DNA"
/db_xref="taxon:356347"
/tissue_type="liver"
/common="Stein's cuscus"
tRNA 1..69
/product="tRNA-Phe"
rRNA 72..1018
/product="s-rRNA"
/note="12S ribosomal RNA"
tRNA 1020..1088
/product="tRNA-Val"
rRNA 1089..2653
/product="l-rRNA"
/note="16S ribosomal RNA"
tRNA 2654..2727
/product="tRNA-Leu"
/codon_recognized="UUR"
gene 2729..3685
/gene="ND1"
/db_xref="GeneID:4117948"
CDS 2729..3685
/gene="ND1"
/codon_start=1
/transl_table=2
/product="NADH dehydrogenase subunit 1"
/protein_id="YP_637062.1"
/db_xref="GI:108793519"
/db_xref="GeneID:4117948"
/translation="MFIINLLMYIIPILLAIAFLTLVERKALGYMQFRKGPNVVGPYG
LLQPIADGMKLFSKEPLQPVTSSTTMFIIAPTLALTLSLTMWTPLPMPHSLIDLNLGL
LFILALSGLSVYSILWSGWASNSKYALMGALRAVAQTISYEVTLAIILLSIMLINGSF
TLKNLITTQENMWLIITTWPLVMMWYVSTLAETNRAPLDLTEGESELVSGFNVEYAAG
PFAMFFLAEYANIMLMNAMTTILFLGSSINHNFTHLNTLSFMTKTIALTFLFLWVRAS
YPRFRYDQLMHLLWKNFLPMTLAMCLWFISIPIALSCIPPQI"
misc_feature 2729..3682
/gene="ND1"
/note="NADH dehydrogenase; Region: NADHdh; cl00469"
/db_xref="CDD:186018"
tRNA 3686..3751
/product="tRNA-Ile"
tRNA complement(3750..3821)
/product="tRNA-Gln"
tRNA 3821..3878
/product="tRNA-Met"
gene 3889..4932
/gene="ND2"
/db_xref="GeneID:4117949"
CDS 3889..4932
/gene="ND2"
/codon_start=1
/transl_table=2
/product="NADH dehydrogenase subunit 2"
/protein_id="YP_637063.1"
/db_xref="GI:108793520"
/db_xref="GeneID:4117949"
/translation="MSPYILLIMLTSLLLGTSLTLFSNHWLTAWMGLEINTLAIIPMM
TYPNHPRATESAIKYFLTQSTASMMLMFAIINNAWMTNQWTLLQTSDQTSSTIMTLAL
AMKLGLAPFHFWVPEVTQGIPLTSGMILLTWQKIAPTSLMYQISPSLNMKILVMLALL
STILGGWGGLNQTHMRKILAYSSIAHMGWMTIIILINPTLTLLNLAIYITTTLTLFLA
LNHSSITKIKSLANLWNKSSSMTIVIALTLLSLGGLPPLTGFMPKWLILQELITYNNI
ATATMMAMSALLNLFFYMRIIYTTTLTMPPSINNSKLQWPHPQTKTTNIIPLLTIISS
FLLPLTPLSITLS"
I used seqFeature and subfeatures but it did not work.
From this file I should get (ND1 and 2729..3685, ND2 and 3889..4932, ... if there was more)
I'm new to biopython and would like help with how to do this.
Solution
The genbank file you posted is not complete, there are sections missed and does not have the //
termination line. Parsers then get stuck trying to read it.
I got the correct file for the Phalanger interpositus mitochondrion from here.
Then (py3k code):
>>>
>>> from Bio import SeqIO
>>> arch = "C:/code/NC_008137.gbk"
>>> record = SeqIO.parse(arch, "genbank")
>>> rec = next(record) # there is only one record
>>> for f in rec.features:
if f.type == 'gene':
print(f.qualifiers['gene'], f.location)
['ND1'] [2728:3685]
['ND2'] [3888:4932]
['COX1'] [5365:6919]
['COX2'] [7052:7737]
['ATP8'] [7798:8005]
['ATP6'] [7959:8640]
['COX3'] [8639:9423]
['ND3'] [9488:9837]
['ND4L'] [9906:10203]
['ND4'] [10196:11574]
['ND5'] [11773:13582]
['ND6'] [13578:14082]
['CYTB'] [14155:15301]
>>>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow