Question

I have a BLAST output in default format. I want to parse and extract only the info I need using regex. However, in the line below

Query= contig1

There is a space there between '=' and 'contig1'. So in my output it prints a space in front. How to avoid this? Below is a piece of my code,

import re
output = open('out.txt','w')
with open('in','r') as f:
    for line in f:
        if re.search('Query=\s', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('Query=\s')
            line = line.rstrip('\s/')
            query = line
            print >> output,query
output.close()

Output should look like this,

contig1
Was it helpful?

Solution

You could actually use the returned match to extract the value you want:

for line in f:
    match = re.search('Query=\s?(.*)', line)
    if match is not None:
        query = match.groups()[0]
        print >> output,query

What we do here is: we search for a Query= followed (or not) by a space character and extract any other characters (with match.groups()[0], because we have only one group in the regular expression).

Also depending on the data nature you might want to do only simple string prefix matching like in the following example:

output = open('out.txt','w')
with open('in.txt','r') as f:
    for line in f:
        if line.startswith('Query='):
            query = line.replace('Query=', '').strip()
            print >> output,query
output.close()

In this case you don't need the re module at all.

OTHER TIPS

If you are just looking for lines like tag=value, do you need regex?

tag,value=line.split('=')
if tag == 'Query':
   print value.strip()
a='Query= conguie'

print "".join(a.split('Query='))

#output conguie

Comma in print statement adds space between parameters. Change

print output,query

to

print "%s%s"%(output,query)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top