How to find set difference of two files?

https://stackoverflow.com/questions/10489473

06-06-2021
|

Question

I have two files A and B. I want to find all the lines in A that are not in B. What's the fastest way to do this in bash/using standard linux utilities? Here's what I tried so far:

for line in `cat file1`
 do
   if [ `grep -c "^$line$" file2` -eq 0]; then
   echo $line
   fi
 done

It works, but it's slow. Is there a faster way of doing this?

Solution

The BashFAQ describes doing exactly this with comm, which is the canonically correct method.

# Subtraction of file1 from file2
# (i.e., only the lines unique to file2)
comm -13 <(sort file1) <(sort file2)

diff is less appropriate for this task, as it tries to operate on blocks rather than individual lines; as such, the algorithms it has to use are more complex and less memory-efficient.

comm has been part of the Single Unix Specification since SUS2 (1997).

OTHER TIPS

If you simply want lines that are in file A, but not in B, you can sort the files, and compare them with diff.

sort A > A.sorted
sort B > B.sorted
diff -u A.sorted B.sorted | grep '^-'

The 'diff' program is standard unix program that looks at differences between files.

% cat A
a
b
c
d
% cat B
a
b
e
% diff A B
3,4c3
< c
< d
---
> e

With a simple grep and cut one can select the lines in A, not in B. Note that the cut is rather simplistic and spaces in the lines would throw it off... but the concept is there.

% diff A B | grep '^<' | cut -f2 -d" "
c
d

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow