Question

I have a huge blast output file in tabular format. I want to sort my data according to protein names, to see which seq-s align to that particular protein. Let's say I have

con19 sp|Q24K02|IDE_BOVIN 3
con19 sp|P35559|IDE_RAT   2
con15 sp|Q24K02|IDE_BOVIN 8
con15 sp|P14735|IDE_HUMAN 30
con16 sp|Q24K02|IDE_BOVIN 45
con16 sp|P35559|IDE_RAT   23

I want to get an output,both are OK

sp|Q24K02|IDE_BOVIN con19 3            sp|Q24K02|IDE_BOVIN con19 3
                    con15 8            sp|Q24K02|IDE_BOVIN con15 8
                    con16 45           sp|Q24K02|IDE_BOVIN con16 45
sp|P35559|IDE_RAT   con19 2            sp|P35559|IDE_RAT   con19 2          
                    con16 23           sp|P35559|IDE_RAT   con16 23
sp|P14735|IDE_HUMAN con15 30           sp|P14735|IDE_HUMAN con15 30



f1 = open('file.txt','r')
lines=f1.readlines()
for line in lines:
    a=sorted(lines)
    r=open('file.txt','w')
    r.writelines(a)
f1.close       
Was it helpful?

Solution

The problem is that you are calling sorted once for each line (i.e. inside the loop), not for the entire set of lines. Try this instead:

f1 = open('file.txt','r')
a=sorted(f1.readlines(), key=lambda l:l.split('|')[1])
r=open('file.txt','w')
r.writelines(a)
f1.close       

OTHER TIPS

You need to sort on the middle element, just sorting the lines themselves will do an alphabetical sort i.e. on the first element. Try this instead:

with open('infile.txt') as f_in, open('outfile.txt', 'w') as f_out:
    f_out.write(''.join(sorted(f_in, key=lambda x: x.split()[1:2])))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top