Fasta file description parsing using biopython

https://stackoverflow.com/questions/7441376

20-01-2021
|

Question

I have a fasta file (first sequence is mentioned below) with long description. I need to pick specific description fields. when i used following code; whole description get into string.

from Bio import SeqIO

for record in SeqIO.parse("geneTemp.fasta", "fasta") :
    id=record.id
    desc=record.description

print desc

Is there any easy way to get the description fields (using biopython libraries) into array and picking specific fields without taking the description into string and spiting the string?

Code output

Python 2.7 (r27:82500, Sep 16 2010, 18:03:06) 
[GCC 4.5.1 20100907 (Red Hat 4.5.1-3)] on localhost.localdomain, Standard
>>> FBgn0197520 type=gene; loc=scaffold_12855:complement(6241650..6242111); ID=FBgn0197520; name=Dvir\GJ10233; dbxref=FlyBase_Annotation_IDs:GJ10233,FlyBase:FBgn0197520,GLEANR:dvir_GLEANR_10171,EntrezGene:6632532,GB_protein:EDW59542,FlyMine:FBgn0197520,OrthoDB4.Arthropods:FBgn0242841,OrthoDB4.Arthropods:FBgn0213090,OrthoDB4.Arthropods:FBgn0190974,OrthoDB4.Arthropods:FBgn0165423,OrthoDB4.Arthropods:FBgn0247590,OrthoDB4.Arthropods:FBgn0149779,OrthoDB4.Arthropods:FBgn0146205,OrthoDB4.Arthropods:FBgn0017456,OrthoDB4.Arthropods:FBgn0126736,OrthoDB4.Arthropods:FBgn0117264,OrthoDB4.Arthropods:FBgn0094317; MD5=0b7e859d2a6eca028ffd16b964835705; length=462; release=r1.2; species=Dvir;
 loc=scaffold_12855:complement(6241650..6242111)

One of the sequences from fasta file.

>FBgn0207418 type=gene; loc=scaffold_12875:complement(14361770..14363857); ID=FBgn0207418; name=Dvir\GJ20278; dbxref=FlyBase_Annotation_IDs:GJ20278,FlyBase:FBgn0207418,GLEANR:dvir_GLEANR_5721,EntrezGene:6625684,GB_protein:EDW61510,FlyMine:FBgn0207418,OrthoDB4.Arthropods:NV16422,OrthoDB4.Arthropods:LH16819,OrthoDB4.Arthropods:ISCW000548,OrthoDB4.Arthropods:FBgn0239668,OrthoDB4.Arthropods:FBgn0219970,OrthoDB4.Arthropods:FBgn0181866,OrthoDB4.Arthropods:FBgn0175499,OrthoDB4.Arthropods:FBgn0080765,OrthoDB4.Arthropods:FBgn0155230,OrthoDB4.Arthropods:FBgn0141947,OrthoDB4.Arthropods:FBgn0033392,OrthoDB4.Arthropods:FBgn0127494,OrthoDB4.Arthropods:FBgn0102879,OrthoDB4.Arthropods:FBgn0090125,OrthoDB4.Arthropods:CPIJ005729,OrthoDB4.Arthropods:GB15324,OrthoDB4.Arthropods:AGAP012336,OrthoDB4.Arthropods:AAEL007395,OrthoDB4.Arthropods:PB24927,OrthoDB4.Arthropods:PHUM365660,OrthoDB4.Arthropods:GLEAN_06039; MD5=4c62b751ec045ac93306ce7c08d254f9; length=2088; release=r1.2; species=Dvir; 
ATGCGTCTGCGACGCCGCTGGCATCGGCGGATGCGGCGTACAATTGAGAA
AATCTATCGCCTTAAAATGCAATCGCGCCGCAAGTTGGTTTACTTAGCCG
TATTTGGAGCACTATGCGTAATATTCTGGCTGGCTGGACAGCAGTTGCTG
ACGACTTCGAATGGTCACTACAGTAGCTACTACGGCGAAACGCATTGTGC
GCCCATTGATGCCGTATACACCTGGGTAAATGGTTCGGATCCGGATTTTA
TTGAGTCCATTAGACGCTACGATGCCAGCTACGATCCGTCGCGCTTCGAC

Solution

The description part of FASTA is not standard. You can use regex to parse it.

>>> import re
>>> desc = 'FBgn0207418 type=gene; loc=scaffold_12875:complement(14361770..14363857); ...'
>>> fields = dict(re.findall(r'(\w+)=(.*?);', desc))
>>> fields['type']
'gene'
>>> fields['length']
'2088'

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow