Question

I have an input file having 15 columns,

con13   tr|M0VCZ1|  91.39   267 23  0   131 211 1   267 1   480 239 267 33.4    99.6
con13   tr|M8B287|  97.12   590 17  0   344 211 1   267 0   104 239 590 74.0    99.8
con15   tr|M0WV77|  92.57   148 11  0   73  516 1   148 2   248 256 148 17.3    99.3
con15   tr|C5WNQ0|  85.14   148 22  0   73  516 1   178 4   233 256 148 17.3    99.3
con15   tr|B8AQC2|  83.78   148 24  0   73  516 1   148 6   233 256 148 17.3    99.3
con18   tr|G9HXG9|  99.66   293 1   0   144 102 1   293 7   527 139 301 63.1    97.0
con18   tr|M0XCZ0|  98.29   293 5   0   144 102 1   293 2   519 139 301 63.1    97.0

I need to 1) group and iterate inside each con (using groupby), 2) sort line[2] from lowest to highest value, 3) see inside each group if line[0], line[8] and line[9] are similar, 4) if they are similar, remove repetitive elements and print the results in a new .txt file choosing the one that has highest value in line[2], so that my output file looks like this,

con13   tr|M8B287|  97.12   590 17  0   344 211 1   267 0   104 239 590 74.0    99.8
con15   tr|M0WV77|  92.57   148 11  0   73  516 1   148 2   248 256 148 17.3    99.3
con15   tr|C5WNQ0|  85.14   148 22  0   73  516 1   178 4   233 256 148 17.3    99.3
con18   tr|G9HXG9|  99.66   293 1   0   144 102 1   293 7   527 139 301 63.1    97.0

My attempted script, prints only one single con and does not sort,

from itertools import groupby
f1 = open('example.txt','r')
f2 = open('result1', 'w')
f3 = open('result2.txt','w')
for k, g in groupby(f1, key=lambda x:x.split()[0]): 
    seen = set()
    for line in g:
        hsp = tuple(line.rsplit())
if hsp[8] and hsp[9] not in seen:
    seen.add(hsp)
    f2.write(line.rstrip() + '\n') 
else:
    f3.write(line.rstrip() + '\n') 
Was it helpful?

Solution

Use the csv module to pre-split your lines for you and write out formatted data again, and use a tuple in seen (of just the 9th and 10th columns) to track similar rows:

import csv
from itertools import groupby
from operator import itemgetter

with open('example.txt','rb') as f1
    with open('result1', 'wb') as f2, open('result2.txt','wb') as f3):
        reader = csv.reader(f1, delimiter='\t')
        writer1 = csv.writer(f2, delimiter='\t')
        writer2 = csv.writer(f3, delimiter='\t')

        for group, rows in groupby(reader, itemgetter(0)):
            rows = sorted(rows, key=itemgetter(8, 9, 2))
            for k, rows in groupby(rows, itemgetter(8, 9)):
                # now we are grouping on columns 8 and 9,
                # *and* these are sorted on column 2
                # everything but the *last* row is written to writer2
                rows = list(rows)
                writer1.writerow(rows[-1])
                writer2.writerows(rows[:-1])

The sorted(rows, key=itemgetter(2)) call sorts the grouped rows (so all rows with the same row[0] value) on the 3rd column.

Because you then want to write only the row with the highest value in column 2 *per group of rows with column 8 and 9 equal) to the first result file, we group again, but sorted on columns 8, 9 and 2 (in that order), then group on just columns 8 and 9 giving us sorted groups in ascending order for column 2. The last row is then written to result1, the rest to result2.txt.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top