extract data from gb file using biopython

https://stackoverflow.com/questions/9149439

22-04-2021
|

Question

I have a gb file and I need to extract some specific features from the file : protein coding genes names and size.

LOCUS       NC_008137              15318 bp    DNA     linear   MAM 15-APR-2009
DEFINITION  Phalanger interpositus mitochondrion, complete genome.
ACCESSION   NC_008137
VERSION     NC_008137.1  GI:108793518
DBLINK      Project: 17043
KEYWORDS    .
SOURCE      mitochondrion Phalanger interpositus (Stein's cuscus)
  ORGANISM  Phalanger interpositus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Metatheria; Diprotodontia; Phalangeridae; Phalanger.
REFERENCE   1  (bases 1 to 15318)
  AUTHORS   Munemasa,M., Nikaido,M., Donnellan,S., Austin,C.C., Okada,N. and
            Hasegawa,M.
  TITLE     Phylogenetic analysis of diprotodontian marsupials based on
            complete mitochondrial genomes
  JOURNAL   Genes Genet. Syst. 81 (3), 181-191 (2006)
   PUBMED   16905872
REFERENCE   2  (bases 1 to 15318)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (12-JUN-2006) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   3  (bases 1 to 15318)
  AUTHORS   Munemasa,M., Nikaido,M., Donnellan,S., Austin,C.C., Okada,N. and
            Hasegawa,M.
  TITLE     Direct Submission
  JOURNAL   Submitted (08-NOV-2005) Tokyo Institute of Technology, Graduate
            School of Bioscience and Biotechnology; Nagatsuta-cho 4259-B-21,
            Midori-ku, Kanagawa 226-8501, Japan
COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
            reference sequence was derived from AB241057.
            Genome sequence lacks part of non-coding region.
            COMPLETENESS: full length.
FEATURES             Location/Qualifiers
     source          1..15318
                     /organism="Phalanger interpositus"
                     /organelle="mitochondrion"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:356347"
                     /tissue_type="liver"
                     /common="Stein's cuscus"
     tRNA            1..69
                     /product="tRNA-Phe"
     rRNA            72..1018
                     /product="s-rRNA"
                     /note="12S ribosomal RNA"
     tRNA            1020..1088
                     /product="tRNA-Val"
     rRNA            1089..2653
                     /product="l-rRNA"
                     /note="16S ribosomal RNA"
     tRNA            2654..2727
                     /product="tRNA-Leu"
                     /codon_recognized="UUR"
     gene            2729..3685
                     /gene="ND1"
                     /db_xref="GeneID:4117948"
     CDS             2729..3685
                     /gene="ND1"
                     /codon_start=1
                     /transl_table=2
                     /product="NADH dehydrogenase subunit 1"
                     /protein_id="YP_637062.1"
                     /db_xref="GI:108793519"
                     /db_xref="GeneID:4117948"
                     /translation="MFIINLLMYIIPILLAIAFLTLVERKALGYMQFRKGPNVVGPYG
                     LLQPIADGMKLFSKEPLQPVTSSTTMFIIAPTLALTLSLTMWTPLPMPHSLIDLNLGL
                     LFILALSGLSVYSILWSGWASNSKYALMGALRAVAQTISYEVTLAIILLSIMLINGSF
                     TLKNLITTQENMWLIITTWPLVMMWYVSTLAETNRAPLDLTEGESELVSGFNVEYAAG
                     PFAMFFLAEYANIMLMNAMTTILFLGSSINHNFTHLNTLSFMTKTIALTFLFLWVRAS
                     YPRFRYDQLMHLLWKNFLPMTLAMCLWFISIPIALSCIPPQI"
     misc_feature    2729..3682
                     /gene="ND1"
                     /note="NADH dehydrogenase; Region: NADHdh; cl00469"
                     /db_xref="CDD:186018"
     tRNA            3686..3751
                     /product="tRNA-Ile"
     tRNA            complement(3750..3821)
                     /product="tRNA-Gln"
     tRNA            3821..3878
                     /product="tRNA-Met"
     gene            3889..4932
                     /gene="ND2"
                     /db_xref="GeneID:4117949"
     CDS             3889..4932
                     /gene="ND2"
                     /codon_start=1
                     /transl_table=2
                     /product="NADH dehydrogenase subunit 2"
                     /protein_id="YP_637063.1"
                     /db_xref="GI:108793520"
                     /db_xref="GeneID:4117949"
                     /translation="MSPYILLIMLTSLLLGTSLTLFSNHWLTAWMGLEINTLAIIPMM
                     TYPNHPRATESAIKYFLTQSTASMMLMFAIINNAWMTNQWTLLQTSDQTSSTIMTLAL
                     AMKLGLAPFHFWVPEVTQGIPLTSGMILLTWQKIAPTSLMYQISPSLNMKILVMLALL
                     STILGGWGGLNQTHMRKILAYSSIAHMGWMTIIILINPTLTLLNLAIYITTTLTLFLA
                     LNHSSITKIKSLANLWNKSSSMTIVIALTLLSLGGLPPLTGFMPKWLILQELITYNNI
                     ATATMMAMSALLNLFFYMRIIYTTTLTMPPSINNSKLQWPHPQTKTTNIIPLLTIISS
                     FLLPLTPLSITLS"

I used seqFeature and subfeatures but it did not work.

From this file I should get (ND1 and 2729..3685, ND2 and 3889..4932, ... if there was more)

I'm new to biopython and would like help with how to do this.

Solution

The genbank file you posted is not complete, there are sections missed and does not have the // termination line. Parsers then get stuck trying to read it.

I got the correct file for the Phalanger interpositus mitochondrion from here.
Then (py3k code):

>>> 
>>> from Bio import SeqIO
>>> arch = "C:/code/NC_008137.gbk"
>>> record = SeqIO.parse(arch, "genbank")
>>> rec = next(record)                       # there is only one record
>>> for f in rec.features:
    if f.type == 'gene':
        print(f.qualifiers['gene'], f.location)


['ND1'] [2728:3685]
['ND2'] [3888:4932]
['COX1'] [5365:6919]
['COX2'] [7052:7737]
['ATP8'] [7798:8005]
['ATP6'] [7959:8640]
['COX3'] [8639:9423]
['ND3'] [9488:9837]
['ND4L'] [9906:10203]
['ND4'] [10196:11574]
['ND5'] [11773:13582]
['ND6'] [13578:14082]
['CYTB'] [14155:15301]
>>>

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow