parsing .xml blast output with re

https://stackoverflow.com/questions/21314423

01-10-2022
|

Вопрос

I'm trying to parse BLAST output in XML format using re, have never done it before, below is my code.

However,since some hits have Hsp_num sometimes more than once, I get more results for query_from and query_to, and less for query_len, how to specify that if Hsp_num is more than 1 do print query_len for it again? thank you

import re
output = open('result.txt','w')
n = 0
with open('file.xml','r') as xml:
    for line in xml:
         if re.search('<Hsp_query-from>', line) != None:
             line = line.strip()
             line = line.rstrip()
             line = line.strip('<Hsp_query-from>')
             line = line.rstrip('</')
             query_from = line
         if re.search('<Hsp_query-to>', line) != None:
             line = line.strip()
             line = line.rstrip()
             line = line.strip('<Hsp_query-to>')
             line = line.rstrip('</')
             query_to = line
         if re.search('<Hsp_num>', line) != None:
             line = line.strip()
             line = line.rstrip()
             line = line.strip('<Hsp_num>')
             line = line.rstrip('</')
             Hsp_num = line
             print >> output, Hsp_num+'\t'+query_from+'\t'+query_to
output.close()

I did query_len in a separate file, since it didnt work..

with open('file.xml','r') as xml:
    for line in xml:
        if re.search('<Iteration_query-len>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Iteration_query-len>')
            line = line.rstrip('</')
            query_len = line

Решение

Are you familiar with Biopython? Its Bio.Blast.NCBIXML module may be just what you need. Chapter 7 of the Tutorial and Cookbook is all about BLAST, and section 7.3 deals with parsing. You'll get an idea of how it works, and it will be a lot easier than using regex to parse XML, which will only lead to tears and mental breakdowns.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow