Question

I am trying to validate two files downloaded from a server. The first contains data and the second file contains the MD5 hash checksum.

I created a function that returns a hexdigest from the data file like so:

def md5(fileName):
    """Compute md5 hash of the specified file"""
    try:
        fileHandle = open(fileName, "rb")
    except IOError:
        print ("Unable to open the file in readmode: [0]", fileName)
        return
    m5Hash = hashlib.md5()
    while True:
        data = fileHandle.read(8192)
        if not data:
            break
        m5Hash.update(data)
    fileHandle.close()
    return m5Hash.hexdigest()

I compare the files using the following:

file = "/Volumes/Mac/dataFile.tbz"
fileHash = md5(file)

hashFile = "/Volumes/Mac/hashFile.tbz.md5"
fileHandle = open(hashFile, "rb")
fileHandleData = fileHandle.read()

if fileHash == fileHandleData:
    print ("Good")
else:
    print ("Bad")

The file comparison fails so I printed out both fileHash and fileHandleData and I get the following:

[0] b'MD5 (hashFile.tbz) = b60d684ab4a2570253961c2c2ad7b14c\n'
[0] b60d684ab4a2570253961c2c2ad7b14c

From the output above the hash values are identical. Why does the hash comparison fail? I am new to python and am using python 3.2. Any suggestions?

Thanks.

Was it helpful?

Solution

The comparison fails for the same reason this is false:

a = "data"
b = b"blah (blah) - data"
print(a == b)

The format of that .md5 file is strange, but if it is always in that format, a simple way to test would be:

if fileHandleData.rstrip().endswith(fileHash.encode()):

Because you have fileHash as a (Unicode) string, you have to encode it to bytes to compare. You may want to specify an encoding rather than use the current default string encoding.

If that exact format is always expected, it would be more robust to use a regex to extract the hash value and possibly check the filename.

Or, more flexibly, you could test substring presence:

if fileHash.encode() in fileHandleData:

OTHER TIPS

You are comparing a hash value to the contents of the fileHandle. You need to get rid of the MD5 (hashFile.tbz) = part as well as the trailing newline, so try:

if fileHash == fileHandleData.rsplit(' ', 1)[-1].rstrip():
    print ("Good")
else:
    print ("Bad")

keep in mind that in Python 3, rsplit() and rstrip() do not support the buffer API and only operate on strings. Hence, as Fred Nurk correctly added, you also need to encode/decode fileHandleData/fileHash (a byte buffer or a (Unicode) string, respectively).

The hash values are identical, but the strings are not. You need to get the hex value of the digest, and you need to parse the hash out of the file. Once you have done those you can compare them for equality.

Try "fileHash.strip("\n")...then compare the two. That should fix the problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top