Fastest way to find duplicate lines in 50GB+ text file

https://stackoverflow.com/questions/20449489

30-08-2022
|

Question

I'm trying to check to make sure an encryption algorithm I wrote is one-to-one. To do so, I looped through the program and wrote all of the outputs (2^32 of them) to a file, one per line. The file is sitting at just over 50GB, after about 9 hours of runtime.

Now I need to run through all of the output lines to verify that there are no duplicates. Here is some sample output:

PAAA#0+//V8//
PAAA#o+//37//
PAAA#Q+//Z7//
ZAAA#d///#
ZAAA#J///#
ZAAA#/+//#

The easiest way I know would be to compare each line with all of the lines following it, but that would be theta(n!) and I don't really think I can stand to wait that long, considering n is 2^32.

Is there a way I can do this kind of comparison in O(n) or O(n log n) time? I'm not opposed to re-outputting it to a database if that would go faster - I was just trying to save disk space at this point.

I wrote the program in c++ on Win7, but I'm not opposed to using other languages on other OSes if it can be done much faster.

Thanks in advance for the help, guys!

Solution

Why don't you run a quicksort on the entire file, then if you only need a yes/no on whether a duplicate exists, you can check each string against the string just before/after it. In fact, if you write the quicksort yourself, you can have it check for duplicates as it sorts.

Alternatively, you could just bucketsort based on the first character of the string, then use multithreading and compare the strings in each bucket (strings in different buckets would never match--they start with different chars).

You could even bucketsort the buckets based on the second character... then bucket sort those buckets based on the third, etc., all the way down. You're end point is reached either when all the buckets have just 1 string in them (no duplicates) or when your buckets with multiple strings contain strings that are shorter then the number of levels deep you are (in which case, you have a duplicate). Again, use multithreading for more speed.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow