How to get complementary lines from two text files?

https://stackoverflow.com/questions/11200630

17-06-2021
|

Question

File file1.txt has

123 foo
234 bar
...

File file2.txt has

123 foo
333 foobar
234 bar
...

I want to get all lines in file1.txt and not in file2.txt. The two files are hundreds of MB large and contain non-ASCII characters. What's a fast way to do this?

Solution

Lines, specifically?

fgrep -vxf file2.txt file1.txt

OTHER TIPS

For good performance with large files, don't read much of the file into memory; work with what's on disk as much as possible.

String-matching can be done efficiently with hashing.

One strategy:

Scan the first file line-by-line. For each line:
- Hash the string for the line. The hashing algorithm you use does matter; djb2 is one example but there are many.
- Put the key into a hash-set structure. Do not keep the string data.
Scan the second file line-by-line. For each line:
- Hash the string for the line.
- If the hash key is not found in the set from the first file:
  - Write the string data for this line to the output where you're tracking the different lines (e.g. standard output or another file). The hash didn't match so this line appears in the 2nd file but not the 1st.

"Hundreds of MB" is not so much.

I would solve this task this way (in Perl):

$ cat complementary.pl 
my %f;

open(F, "$ARGV[1]") or die "Can't open file2: $ARGV[1]\n";
$f[$_] = 1 while(<F>);
close(F);

open(F, "$ARGV[0]") or die "Can't open file1: $ARGV[0]\n";
while(<F>) {
    print if not defined $f[$_];
}

Example of usage:

$ cat file1.txt 
100 a
200 b
300 c

$ cat file2.txt 
200 b
100 a
400 d

$ perl complementary.pl file1.txt file2.txt 
300 c

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow