Removing duplicates in a huge .csv file

Question 1

The main reason this is slow is that you're reading the entire file (or, rather, a duplicate copy of it) for each line in the file. So, if there are 10000 lines, you're reading 10000 lines 10000 times, meaning 10000000 total line reads!

If you have enough memory to save the lines read so far, there's a really easy solution: Store the lines seen so far in a set. (Or, rather, for each line, store the tuple of the three keys that count for being a duplicate.) For each line, if it's already in the set, skip it; otherwise, process it and add it to the set.

For example:

seen = set()
for line in infile:
    testname, vid, tstamp = line.split(",", 3)[:3]
    if (testname, vid, tstamp) in seen:
        continue
    seen.add((testname, vid, tstamp))
    outfile.write(line)

The itertools recipes in the docs have a function unique_everseen that lets you wrap this up even more nicely:

def keyfunc(line):
    return tuple(line.split(",", 3)[:3])
for line in unique_everseen(infile, key=keyfunc):
    outfile.write(line)

If the set takes too much memory, you can always fake a set on top of a dict, and you can fake a dict on top of a database by using the dbm module, which will do a pretty good job of keeping enough in memory to make things fast but not enough to cause a problem. The only problem is that dbm keys have to be strings, not tuples of three strings… but you can always just keep them joined up (or re-join them) instead of splitting, and then you've got a string.

I'm assuming that when you say the file is "sorted", you mean in terms of the timestamp, not in terms of the key columns. That is, there's no guarantee that two rows that are duplicates will be right next to each other. If there were, this is even easier. It may not seem easier if you use the itertools recipes; you're just replacing everseen with justseen:

def keyfunc(line):
    return tuple(line.split(",", 3)[:3])
for line in unique_justseen(infile, key=keyfunc):
    outfile.write(line)

But under the covers, this is only keeping track of the last line, rather than a set of all lines. Which is not only faster, it also saves a lot of memory.

Now that (I think) I understand your requirements better, what you actually want to get rid of is not all but the first line with the same testname, vid, and tstamp, but rather all lines with the same testname and vid except the one with the highest tstamp. And since the file is sorted in ascending order of tstamp, that means you can ignore the tstamp entirely; you just want the last match for each.

This means the everseen trick won't work—we can't skip the first one, because we don't yet know there's a later one.

If we just iterated the file backward, that would solve the problem. It would also double your memory usage (because, in addition to the set, you're also keeping a list so you can stack up all of those lines in reverse order). But if that's acceptable, it's easy:

def keyfunc(line):
    return tuple(line.split(",", 2)[:2])
for line in reversed(list(unique_everseen(reversed(list(infile)), key=keyfunc))):
    outfile.write(line)

If turning those lazy iterators into lists so we can reverse them takes too much memory, it's probably fastest to do multiple passes: reverse the file on disk, then filter the reversed file, then reverse it again. It does mean two extra file writes, but that can be a lot better than, say, your OS's virtual memory swapping to and from disk hundreds of times (or your program just failing with a MemoryError).

If you're willing to do the work, it wouldn't be that hard to write a reverse file iterator, which reads buffers from the end and splits on newlines and yields the same way the file/io.Whatever object does. But I wouldn't bother unless you turn out to need it.

If you ever do need to repeatedly read particular line numbers out of a file, the linecache module will usually speed things up a lot. Nowhere near as fast as not re-reading at all, of course, but a lot better than reading and parsing thousands of newlines.

You're also wasting time repeating some work in the inner loop. For example, you call line2.split(",") three times, instead of just splitting it once and stashing the value in a variable, which would be three times as fast. A 3x constant gain is nowhere near as important as a quadratic to linear gain, but when it comes for free by making your code simpler and more readable, might as well take it.

Question 2

For this much file size(~15MB) Pandas would be excellent choice. Like this:

import pandas as pd
raw_data = pd.read_csv()
clean_data = raw_data.drop_duplicates()
clean_data.to_csv('/path/to/clean_csv.csv')

I was able to process a CSV file about 151MB of size containing more than 5.9Million rows in less than a second with the above snippet. Please note that the duplicate checking can be a conditional operation or a subset of fields to be matched for duplicate checking. Pandas does provide lot of these features out of the box. Documentation here