Parse only top 3 hits from BLAST output with NCBIXML

https://stackoverflow.com/questions/13835912

07-12-2021
|

문제

I modified a piece of code as below to parse desired information from BLAST XML output.

import csv
from Bio.Blast import NCBIXML
blast_records = NCBIXML.parse(open('PGblast.xml', 'rU'))

output = csv.writer(open('PGhit.csv','w'), delimiter =',',
                    quoting=csv.QUOTE_NONNUMERIC)
output.writerow(["Query","Hit ID", "Hit Def", "E-Value"])

E_VALUE_THRESH = 0.00000000000000001

for blast_record in blast_records:
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < E_VALUE_THRESH:
                output.writerow([blast_record.query[:8],
                                 alignment.hit_id, alignment.hit_def,hsp.expect])

blast_records.close()

The code allowed me parse the hits with E-value cut off. But, I wish to parse let's say only the best hit or top 3 hits from BLAST XML output as the BLAST output file is big in size.

Having every hit result parsed will take a lot of time to process and I don't want all hit results in fact.

Could someone kindly please help me?

해결책

Parsing only the top 3 HSP of each Hit, without parsing the whole file, would require you to write your own custom XML parser. Biopython's NCBIXML does not do this.

However, if it's speed improvement you're looking for, you could try the new SearchIO submodule (http://biopython.org/wiki/SearchIO). It has a new BLAST XML parser that's supposed to be faster than the old NCBIXML parser. The old parser relies on a pure-Python XML parser, while the new one in SearchIO uses cElementTree whenever possible.

The submodule is still new and experimental, so there might still be some changes before it hits an official release. If you're interested, there's also a draft tutorial here: http://bow.web.id/biopython/Tutorial.html#htoc96.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow