The main reason this is slow is that you're reading the entire file (or, rather, a duplicate copy of it) for each line in the file. So, if there are 10000 lines, you're reading 10000 lines 10000 times, meaning 10000000 total line reads!
If you have enough memory to save the lines read so far, there's a really easy solution: Store the lines seen so far in a set. (Or, rather, for each line, store the tuple of the three keys that count for being a duplicate.) For each line, if it's already in the set, skip it; otherwise, process it and add it to the set.
For example:
seen = set()
for line in infile:
testname, vid, tstamp = line.split(",", 3)[:3]
if (testname, vid, tstamp) in seen:
continue
seen.add((testname, vid, tstamp))
outfile.write(line)
The itertools
recipes in the docs have a function unique_everseen
that lets you wrap this up even more nicely:
def keyfunc(line):
return tuple(line.split(",", 3)[:3])
for line in unique_everseen(infile, key=keyfunc):
outfile.write(line)
If the set takes too much memory, you can always fake a set on top of a dict, and you can fake a dict on top of a database by using the dbm
module, which will do a pretty good job of keeping enough in memory to make things fast but not enough to cause a problem. The only problem is that dbm keys have to be strings, not tuples of three strings… but you can always just keep them joined up (or re-join
them) instead of splitting, and then you've got a string.
I'm assuming that when you say the file is "sorted", you mean in terms of the timestamp, not in terms of the key columns. That is, there's no guarantee that two rows that are duplicates will be right next to each other. If there were, this is even easier. It may not seem easier if you use the itertools recipes; you're just replacing everseen
with justseen
:
def keyfunc(line):
return tuple(line.split(",", 3)[:3])
for line in unique_justseen(infile, key=keyfunc):
outfile.write(line)
But under the covers, this is only keeping track of the last line, rather than a set of all lines. Which is not only faster, it also saves a lot of memory.
Now that (I think) I understand your requirements better, what you actually want to get rid of is not all but the first line with the same testname
, vid
, and tstamp
, but rather all lines with the same testname
and vid
except the one with the highest tstamp
. And since the file is sorted in ascending order of tstamp
, that means you can ignore the tstamp
entirely; you just want the last match for each.
This means the everseen
trick won't work—we can't skip the first one, because we don't yet know there's a later one.
If we just iterated the file backward, that would solve the problem. It would also double your memory usage (because, in addition to the set, you're also keeping a list so you can stack up all of those lines in reverse order). But if that's acceptable, it's easy:
def keyfunc(line):
return tuple(line.split(",", 2)[:2])
for line in reversed(list(unique_everseen(reversed(list(infile)), key=keyfunc))):
outfile.write(line)
If turning those lazy iterators into lists so we can reverse them takes too much memory, it's probably fastest to do multiple passes: reverse the file on disk, then filter the reversed file, then reverse it again. It does mean two extra file writes, but that can be a lot better than, say, your OS's virtual memory swapping to and from disk hundreds of times (or your program just failing with a MemoryError
).
If you're willing to do the work, it wouldn't be that hard to write a reverse file iterator, which reads buffers from the end and splits on newlines and yields the same way the file
/io.Whatever
object does. But I wouldn't bother unless you turn out to need it.
If you ever do need to repeatedly read particular line numbers out of a file, the linecache
module will usually speed things up a lot. Nowhere near as fast as not re-reading at all, of course, but a lot better than reading and parsing thousands of newlines.
You're also wasting time repeating some work in the inner loop. For example, you call line2.split(",")
three times, instead of just splitting it once and stashing the value in a variable, which would be three times as fast. A 3x constant gain is nowhere near as important as a quadratic to linear gain, but when it comes for free by making your code simpler and more readable, might as well take it.