Remove duplicates from two large text files using unordered_map

https://stackoverflow.com/questions/6334197

27-10-2019
|

Question

I am new to a lot of these C++ libraries, so please forgive me if my questions comes across as naive.

I have two large text files, about 160 MB each (about 700000 lines each). I need to remove from file2 all of the duplicate lines that appear in file1. To achieve this, I decided to use unordered_map with a 32 character string as my key. The 32 character string is the first 32 chars of each line (this is enough to uniquely identify the line).

Anyway, so I basically just go through the first file and push the 32 char substring of each line into the unordered_map. Then I go through the second file and check whether the line in file2 exists in my unordered_map. If it doesn't exist, the I write the full line to a new text file.

This works fine for the smaller files.. (40 MB each), but for this 160 MB files.. it takes very long to insert into the hashtable (before I even start looking at file2). At around 260,000 inserts.. it seems to have halted or is going very slow. Is it possible that I have reached my memory limitations? If so, can anybody explain how to calculate this? If not, is there something else that I could be doing to make it faster? Maybe choosing a custom hash function, or specifying some parameters that would help optimize it?

My key object pair into the hash table is (string, int), where the string is always 32 chars long, and int is a count I use to handle duplicates. I am running a 64 bit Windows 7 OS w/ 12 GB RAM.

Any help would be greatly appreciated.. thanks guys!!

Solution

You don't need a map because you don't have any associative data. An unordered set will do the job. Also, I'd go with some memory efficient hash set implementation like Google's sparse_hash_set. It is very memory efficient and is able to store contents on disk.

Aside from that, you can work on smaller chunks of data. For example, split your files into 10 blocks, remove duplicates from each, then combine them until you reach the a single block with no duplicates. You get the idea.

OTHER TIPS

I would not write a C++ program to do this, but use some existing utilities. In Linux, Unix and Cygwin, perform the following:

cat the two files into 1 large file:

# cat file1 file2 > file3

Use sort -u to extract the unique lines:

# sort -u file3 > file4

Prefer to use operating system utilities rather than (re)writing your own.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow