If I understand you correctly, you want to send all lines for your input file that have a first element txt1
to your first output file if the second element txt2
of all those lines is the same; otherwise all those lines go to the second output file. Here is a program that does that.
from collections import defaultdict
# Read in file line-by-line for the first time
# Build up dictionary of txt1 to set of txt2 s
txt1totxt2 = defaultdict(set)
f=open('file.txt','r')
for line in f:
lst = line.split()
txt1=lst[0]
txt2=lst[1]
txt1totxt2[txt1].add(txt2);
# The dictionary tells us whether the second text
# is unique or not. If it's unique the set has
# just one element; otherwise the set has > 1 elts.
# Read in file for second time, sending each line
# to the appropriate output file
f.seek(0)
oh1=open('result1.txt','w')
oh2=open('result2.txt','w')
for line in f:
lst = line.split()
txt1=lst[0]
if len(txt1totxt2[txt1]) == 1:
oh1.write(line)
else:
oh2.write(line)
The program logic is very simple. For each txt
it builds up a set
of txt2
s that it sees. When you're done reading the file, if the set has just one element, then you know that the txt2
s are unique; if the set has more than one element, then there are at least two txt2
s. Note that this means that if you only have one line in the input file with a particular txt1
, it will always be sent to the first output file. There are ways round this if this is not the behaviour you want.
Note also that because the file is large, I've read it in line-by-line: lines=f.readlines()
in your original program reads the whole file into memory at a time. I've stepped through it twice: the second time does the output. If this increases the run time then you can restore the lines=f.readlines()
instead of reading it a second time. However the program as is should be much more robust to very large files. Conversely if your files are very large indeed, it would be worth looking at the program to reduce the memory usage further (the dictionary txt1totxt2
could be replaced with something more optimal, albeit more complicated, if necessary).
Edit: there was a good point in comments (now deleted) about the memory cost of this algorithm. To elaborate, the memory usage could be high, but on the other hand it isn't as severe as storing the whole file: rather txt1totxt2
is a dictionary from the first text in each line to a set of the second text, which is of the order of (size of unique first text) * (average size of unique second text for each unique first text). This is likely to be a lot smaller than the file size, but the approach may require further optimization. The approach here is to get something simple going first -- this can then be iterated to optimize further if necessary.