Question

I want to compare a number of files and find out which files which are the same, but they are not necessarily text files(So please don't suggest diff)

The files can be in any format (ie binary files).

I found out that I can run md5sum to find the hash of each file and then compare it manually to check if they are the same . But how can I automate this process ?

Ps : I also found that I can store the md5sums in a file using

md5sum <file-names> | cat >md5sum.txt

but I am stuck on how to automate this process.

I would prefer this to be done via a script (language no-bar).

Was it helpful?

Solution

If you can use languages like perl or python with builtin support for hashes/dictionnaries, it's really easy.

Loop over file names and signature and create a hash with md5sum as key and list of files with that md5 as value.

Then loop over content of hash and show entries with more than one item. These are files likely to be identical (you can't be really sure with a signature based approach).

As people are asking for code, maybe something like below. That is a perl implementation. I may add an equivalent python sample later if it is wanted.

#!perl
my $same = {};
for my $x (@ARGV){
    my ($sig, $name) = split(/\s+/, `md5sum $x`);
    if (!defined($same{$sig})){$same{$sig} = []}
    push @{$same{$sig}}, $name;
}

for my $sig (keys %same){
    if (@{$same{$sig}} > 1){
        print "Files with same MD5 : ".join('-', @{$same{$sig}})."\n";
    }
}

Say you put that in a file same.pl, you call it like:

perl same.pl

exemple of use:

$ md5sum F*
c9904273735f3141c1dd61533e02246a  F1
c9904273735f3141c1dd61533e02246a  F2
c9904273735f3141c1dd61533e02246a  F3
d41d8cd98f00b204e9800998ecf8427e  F4

$ perl same.pl F1 F2 F3 F4
Files with same MD5 : F1-F2-F3

Below is a possible python version (working with both python2 and python3).

#!python

import hashlib

def md5sum(filename):
    f = open(filename, mode='rb')
    buf = f.read(128)
    d = hashlib.md5(buf)
    while len(buf) == 128:
        buf = f.read(128)
        d.update(buf)
    return d.hexdigest()


if __name__ == "__main__":
    import sys
    same = {}
    for name in sys.argv[1:]:
        sig = md5sum(name)
        same.setdefault(sig, []).append(name)

    for k in same:
        if len(same[k]) > 1:
            print("Files with same MD5: {l}".format(l="-".join(same[k])))

Note that if you are comparing really large number of files, providing file names on command line as in the above exemples may not be enough and you should use some more elaborate way to do that (or put some glob inside the script), or the shell command line will overflow.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top