Use the csv
module to pre-split your lines for you and write out formatted data again, and use a tuple in seen
(of just the 9th and 10th columns) to track similar rows:
import csv
from itertools import groupby
from operator import itemgetter
with open('example.txt','rb') as f1
with open('result1', 'wb') as f2, open('result2.txt','wb') as f3):
reader = csv.reader(f1, delimiter='\t')
writer1 = csv.writer(f2, delimiter='\t')
writer2 = csv.writer(f3, delimiter='\t')
for group, rows in groupby(reader, itemgetter(0)):
rows = sorted(rows, key=itemgetter(8, 9, 2))
for k, rows in groupby(rows, itemgetter(8, 9)):
# now we are grouping on columns 8 and 9,
# *and* these are sorted on column 2
# everything but the *last* row is written to writer2
rows = list(rows)
writer1.writerow(rows[-1])
writer2.writerows(rows[:-1])
The sorted(rows, key=itemgetter(2))
call sorts the grouped rows (so all rows with the same row[0]
value) on the 3rd column.
Because you then want to write only the row with the highest value in column 2 *per group of rows with column 8 and 9 equal) to the first result file, we group again, but sorted on columns 8, 9 and 2 (in that order), then group on just columns 8 and 9 giving us sorted groups in ascending order for column 2. The last row is then written to result1
, the rest to result2.txt
.