Biopython (or just Python in general): Most Efficient Way to Parse Species Name From A large .fasta file using gi identifier

https://stackoverflow.com/questions/10009722

29-05-2021
|

Question

I have a .fasta file (.txt essentiallly) of about 145000 entries that are formatted as below

>gi|393182|gb|AAA40101.1| cytokine [Mus musculus]
MDAKVVAVLALVLAALCISDGKPVSLSYRCPCRFFESHIARANVKHLKILNTPNCALQIVARLKNNNRQV
CIDPKLKWIQEYLEKALNKRLKM

>gi|378792467|pdb|3UNH|Y Chain Y, Mouse 20s Immunoproteasome
TTTLAFKFQHGVIVAVDSRATAGSYISSLRMNKVIEINPYLLGTMSGCAADCQYWERLLAKECRLYYLRN
GERISVSAASKLLSNMMLQYRGMGLSMGSMICGWDKKGPGLYYVDDNGTRLSGQMFSTGSGNTYAYGVMD
SGYRQDLSPEEAYDLGRRAIAYATHRDNYSGGVVNMYHMKEDGWVKVESSDVSDLLYKYGEAAL

>gi|378792462|pdb|3UNH|T Chain T, Mouse 20s Immunoproteasome
MSSIGTGYDLSASTFSPDGRVFQVEYAMKAVENSSTAIGIRCKDGVVFGVEKLVLSKLYEEGSNKRLFNV
DRHVGMAVAGLLADARSLADIAREEASNFRSNFGYNIPLKHLADRVAMYVHAYTLYSAVRPFGCSFMLGS
YSANDGAQLYMIDPSGVSYGYWGCAIGKARQAAKTEIEKLQMKEMTCRDVVKEVAKIIYIVHDEVKDKAF
ELELSWVGELTKGRHEIVPKDIREEAEKYAKESLKEEDESDDDNM

I have a list of gi's (the first number listed after the |).
The size of this list varies between 60 - 600 gi's for a given test
I want to return a list with respective species of those gi's
The species name is usually seen as in the first example (surrounded by square brackets [Mus musculus]) it is not always present.
Order is not particularly important.

I have been using various BioPython parsing bits and pieces but I think because of the size of the search it fails. I was hoping someone on here would know of a more efficient way?

Thanks in advance!

Solution

Rather than parsing the not entirely consistent FASTA header line for the species, you could just extract the GI number and then look up the NCBI taxonomy ID, e.g. see http://lists.open-bio.org/pipermail/biopython/2009-June/005304.html - and from the taxid you can get the species name, common name, lineage etc. See ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt or if you prefer an online solution, the Entrez Utilities (EUtils) are another option.

OTHER TIPS

Without having the data to try it out on, I'd suggest the quickest way would be to load the gi you want into a set. Then read through the fasta file with the minimum amount of processing to extract the gi in the line and then if the gi is in the wanted set, extract the last | delimited field.

E.g.:

# assuming gi list is in a file, one per line
with open('lookup_list.txt','r') as f:
  wanted = set(x.strip() for x in f)

with open('data.fasta','r') as f:
  for line in f:
    if line and line[0] == '>':
      gi = line[4:line.find('|',4)]
      if gi in wanted:
        text = line[line.rfind('|')+1:] # Now process the text to extract species
        print text

How you extract the species name from the description field will depend on the various formats you find.

Really, really naive approach

with open('my.fasta') as fd:
    for line in fd:
        if line.startswith('>'):
            if '[' in line:
                gi=line.split('|')[1]
                name=line.split('[')[-1]
                print gi, name[:-2]

For the above, the output is:

393182 Mus musculus

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow