@senderle has a great answer, but since he mentioned that my solution will produce false positives, I figured the gauntlet had been laid and I'd better show some code. I thinned down your md5 function (it should always use the 'fileSliceLimitation' case and should be less stingy with its input buffer), then prefiltered by size before doing the md5s.
import sys
import os
import hashlib
from collections import defaultdict
searchdirpath = sys.argv[1]
size_map = defaultdict(list)
def getFileHashMD5(filename):
m = hashlib.md5()
with open(filename, 'rb', 1024*1024) as fh:
while True:
data = fh.read(1024*1024)
if not data:
break
m.update(data)
return m.hexdigest()
# group files by size
for dirname, dirnames, filenames in os.walk(searchdirpath):
for filename in filenames:
fullname = os.path.join(dirname, filename)
size_map[os.stat(fullname).st_size].append(fullname)
# scan files of same size
for fullnames in size_map.itervalues():
if len(fullnames) > 0:
hash_map = defaultdict(list)
for fullname in fullnames:
hash_map[getFileHashMD5(fullname)].append(fullname)
for fullnames in hash_map.itervalues():
if len(fullnames) > 1:
print "duplicates:"
for fullname in fullnames:
print " ", fullname
(EDIT)
There were several questions about this implementation that I will try to answer here:
1) why (1024*1024) size not '5000000'
Your original code read in 8192 (8 KiB) increments, which is very small for modern systems. You are likely to get better performance by grabbing more at once. 1024*1024 is 1048576 (1 MiB) bytes and was just a guess at a reasonable number. As for why I wrote it in such a strange way, 1000 (decimal kilobyte) is loved by people but 1024 (binary kibibyte) is loved by computers and file systems. I am in the habit of writing some_number*1024
so its easy to see that I'm refering to 1 KiB increments. 5000000 is a reasonable number too, but you should consider 5*1024*1024 (that is 5 MiB) so that you get something that is nicely aligned for the file system.
2) what does this bit do exactly: size_map = defaultdict(list)
It creates a 'defaultdict' which adds functionality to a regular dict object. A regular dict raises a KeyError exception when it is indexed by a non-existant key. defaultdict creates a default value and adds that key/value pair to the dict instead. In our case, size_map[some_size]
says "give me the list of files of some_size and create a new empty list if you don't have one".
size_map[os.stat(fullname).st_size].append(fullname)
. This breaks down to:
stat = os.stat(fullname)
size = stat.st_size
filelist = size_map[size] # this is the same as:
# if size not in size_map:
# size_map[size] = list()
# filelist = size_map[size]
filelist.append(fullname)
3) sys.argv[1] I'm guessing the sys.argv[1] just makes the python py.py 'filepath' argument work (where filepath is the argv[1] ?
Yes, when you call a python script, sys.argv[0] is the name of the script and sys.argv[1:] (arg 1 and following) are any additional arguments given on the command line. I used sys.argv[1] as a quick way to test the script when I wrote it and you should change that to meet your needs.