Question

I know that this question has been asked before, and I've saw some of the answers, but this question is more about my code and the best way of accomplishing this task.

I want to scan a directory and see if there are any duplicates (by checking MD5 hashes) in that directory. The following is my code:

import sys
import os
import hashlib

fileSliceLimitation = 5000000 #bytes

# if the file is big, slice trick to avoid to load the whole file into RAM
def getFileHashMD5(filename):
     retval = 0;
     filesize = os.path.getsize(filename)

     if filesize > fileSliceLimitation:
        with open(filename, 'rb') as fh:
          m = hashlib.md5()
          while True:
            data = fh.read(8192)
            if not data:
                break
            m.update(data)
          retval = m.hexdigest()

     else:
        retval = hashlib.md5(open(filename, 'rb').read()).hexdigest()

     return retval

searchdirpath = raw_input("Type directory you wish to search: ")
print ""
print ""    
text_file = open('outPut.txt', 'w')

for dirname, dirnames, filenames in os.walk(searchdirpath):
    # print path to all filenames.
    for filename in filenames:
        fullname = os.path.join(dirname, filename)
        h_md5 = getFileHashMD5 (fullname)
        print h_md5 + " " + fullname
        text_file.write("\n" + h_md5 + " " + fullname)   

# close txt file
text_file.close()


print "\n\n\nReading outPut:"
text_file = open('outPut.txt', 'r')

myListOfHashes = text_file.read()

if h_md5 in myListOfHashes:
    print 'Match: ' + " " + fullname

This gives me the following output:

Please type in directory you wish to search using above syntax: /Users/bubble/Desktop/aF

033808bb457f622b05096c2f7699857v /Users/bubble/Desktop/aF/.DS_Store
409d8c1727960fddb7c8b915a76ebd35 /Users/bubble/Desktop/aF/script copy.py
409d8c1727960fddb7c8b915a76ebd25 /Users/bubble/Desktop/aF/script.py
e9289295caefef66eaf3a4dffc4fe11c /Users/bubble/Desktop/aF/simpsons.mov

Reading outPut:
Match:  /Users/bubble/Desktop/aF/simpsons.mov

My idea was:

1) Scan directory 2) Write MD5 hashes + Filename to text file 3) Open text file as read only 4) Scan directory AGAIN and check against text file...

I see that this isn't a good way of doing it AND it doesn't work. The 'match' just prints out the very last file that was processed.

How can I get this script to actually find duplicates? Can someone tell me a better/easier way of accomplishing this task.

Thank you very much for any help. Sorry this is a long post.

Was it helpful?

Solution

The obvious tool for identifying duplicates is a hash table. Unless you are working with a very large number of files, you could do something like this:

from collections import defaultdict

file_dict = defaultdict(list)
for filename in files:
    file_dict[get_file_hash(filename)].append(filename)

At the end of this process, file_dict will contain a list for every unique hash; when two files have the same hash, they'll both appear in the list for that hash. Then filter the dict looking for value lists longer than 1, and compare the files to make sure they're the same -- something like this:

for duplicates in file_dict.values():   # file_dict.itervalues() in Python 2
    if len(duplicates) > 1:
        # double-check reported duplicates and generate output

Or this:

duplicates = [files for files in file_dict.values() if len(files) > 1]

get_file_hash could use MD5s; or it could simply get the first and last bytes of the file as Ramchandra Apte suggested in the comments above; or it could simply use file sizes as tdelaney suggested in the comments above. Each of the latter two strategies are more likely to produce false positives though. You could combine them to reduce the false positive rate.

If you're working with a very large number of files, you could use a more sophisticated data structure like a Bloom Filter.

OTHER TIPS

@senderle has a great answer, but since he mentioned that my solution will produce false positives, I figured the gauntlet had been laid and I'd better show some code. I thinned down your md5 function (it should always use the 'fileSliceLimitation' case and should be less stingy with its input buffer), then prefiltered by size before doing the md5s.

import sys
import os
import hashlib
from collections import defaultdict

searchdirpath = sys.argv[1]

size_map = defaultdict(list)

def getFileHashMD5(filename):
    m = hashlib.md5()
    with open(filename, 'rb', 1024*1024) as fh:
          while True:
            data = fh.read(1024*1024)
            if not data:
                break
            m.update(data)
    return m.hexdigest()

# group files by size
for dirname, dirnames, filenames in os.walk(searchdirpath):
    for filename in filenames:
        fullname = os.path.join(dirname, filename)
        size_map[os.stat(fullname).st_size].append(fullname)

# scan files of same size
for fullnames in size_map.itervalues():
    if len(fullnames) > 0:
        hash_map = defaultdict(list)
        for fullname in fullnames:
            hash_map[getFileHashMD5(fullname)].append(fullname)
        for fullnames in hash_map.itervalues():
            if len(fullnames) > 1:
                print "duplicates:"
                for fullname in fullnames:
                    print "   ", fullname

(EDIT)

There were several questions about this implementation that I will try to answer here:

1) why (1024*1024) size not '5000000'

Your original code read in 8192 (8 KiB) increments, which is very small for modern systems. You are likely to get better performance by grabbing more at once. 1024*1024 is 1048576 (1 MiB) bytes and was just a guess at a reasonable number. As for why I wrote it in such a strange way, 1000 (decimal kilobyte) is loved by people but 1024 (binary kibibyte) is loved by computers and file systems. I am in the habit of writing some_number*1024 so its easy to see that I'm refering to 1 KiB increments. 5000000 is a reasonable number too, but you should consider 5*1024*1024 (that is 5 MiB) so that you get something that is nicely aligned for the file system.

2) what does this bit do exactly: size_map = defaultdict(list)

It creates a 'defaultdict' which adds functionality to a regular dict object. A regular dict raises a KeyError exception when it is indexed by a non-existant key. defaultdict creates a default value and adds that key/value pair to the dict instead. In our case, size_map[some_size] says "give me the list of files of some_size and create a new empty list if you don't have one".

size_map[os.stat(fullname).st_size].append(fullname). This breaks down to:

stat = os.stat(fullname)
size = stat.st_size
filelist = size_map[size]    # this is the same as:
                             #    if size not in size_map:
                             #        size_map[size] = list()
                             #    filelist = size_map[size]
filelist.append(fullname)

3) sys.argv[1] I'm guessing the sys.argv[1] just makes the python py.py 'filepath' argument work (where filepath is the argv[1] ?

Yes, when you call a python script, sys.argv[0] is the name of the script and sys.argv[1:] (arg 1 and following) are any additional arguments given on the command line. I used sys.argv[1] as a quick way to test the script when I wrote it and you should change that to meet your needs.

The first thing you're going to want to do is save the h_md5's to a list as you loop through your files. Something like:

h_md5=[]

before you loop through your directory. And

h_md5.append(getFileHashMD5(fullname))

inside your loop. Now you have a list of hashes to compare with your output file instead of simply the last one you made in your loop.

Also, obviously, with your current code you are going to find one match for each file every time because you will find hash for that particular file itself in your list. So if you want to look for duplicates you are going to have to look for instances where two distinct matches are found.

edit: the answer above @senderle is a much better way to do this if you are willing to change your code.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top