Get MD5 hash of big files in Python

https://stackoverflow.com/questions/1131220

16-09-2019
|

Question

I have used hashlib (which replaces md5 in Python 2.6/3.0) and it worked fine if I opened a file and put its content in hashlib.md5() function.

The problem is with very big files that their sizes could exceed RAM size.

How to get the MD5 hash of a file without loading the whole file to memory?

Solution

Break the file into 128-byte chunks and feed them to MD5 consecutively using update().

This takes advantage of the fact that MD5 has 128-byte digest blocks. Basically, when MD5 digest()s the file, this is exactly what it is doing.

If you make sure you free the memory on each iteration (i.e. not read the entire file to memory), this shall take no more than 128 bytes of memory.

One example is to read the chunks like so:

f = open(fileName)
while not endOfFile:
    f.read(128)

OTHER TIPS

You need to read the file in chunks of suitable size:

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

NOTE: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.

So to do the whole lot in one method - use something like:

def generate_file_md5(rootdir, filename, blocksize=2**20):
    m = hashlib.md5()
    with open( os.path.join(rootdir, filename) , "rb" ) as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            m.update( buf )
    return m.hexdigest()

The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 windows installation

I cross-checked the results using the 'jacksum' tool.

jacksum -a md5 <filename>

http://www.jonelo.de/java/jacksum/

if you care about more pythonic (no 'while True') way of reading the file check this code:

import hashlib

def checksum_md5(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(8192), b''): 
            md5.update(chunk)
    return md5.digest()

Note that the iter() func needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').

Here's my version of @Piotr Czapla's method:

def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename, 'rb') as f:
        for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

Using multiple comment/answers in this thread, here is my solution :

import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
    '''
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)
    '''
    md5 = hashlib.md5()
    with open(path,'rb') as f: 
        for chunk in iter(lambda: f.read(block_size), b''): 
             md5.update(chunk)
    if hr:
        return md5.hexdigest()
    return md5.digest()

This is "pythonic"
This is a function
It avoids implicit values: always prefer explicit ones.
It allows (very important) performances optimizations

And finally,

- This has been built by a community, thanks all for your advices/ideas.

A Python 2/3 portable solution

To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you'll sum bytes values:

To be py27/py3 portable, you ought to use the io packages, like this:

import hashlib
import io


def md5sum(src):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        content = fd.read()
        md5.update(content)
    return md5

If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:

def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
    return md5

The trick here is to use the iter() function with a sentinel (the empty string).

The iterator created in this case will call o [the lambda function] with no arguments for each call to its next() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.

If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:

def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
    calculated = 0
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
            calculated += len(chunk)
            callback(calculated)
    return md5

A remix of Bastien Semene code that take Hawkwing comment about generic hashing function into consideration...

def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
    """
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)

    Linux Ext4 block size
    sudo tune2fs -l /dev/sda5 | grep -i 'block size'
    > Block size:               4096

    Input:
        path: a path
        algorithm: an algorithm in hashlib.algorithms
                   ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
        block_size: a multiple of 128 corresponding to the block size of your filesystem
        human_readable: switch between digest() or hexdigest() output, default hexdigest()
    Output:
        hash
    """
    if algorithm not in hashlib.algorithms:
        raise NameError('The algorithm "{algorithm}" you specified is '
                        'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))

    hash_algo = hashlib.new(algorithm)  # According to hashlib documentation using new()
                                        # will be slower then calling using named
                                        # constructors, ex.: hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(block_size), b''):
             hash_algo.update(chunk)
    if human_readable:
        file_hash = hash_algo.hexdigest()
    else:
        file_hash = hash_algo.digest()
    return file_hash

u can't get it's md5 without read full content. but u can use update function to read the files content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b)

I think the following code is more pythonic:

from hashlib import md5

def get_md5(fname):
    m = md5()
    with open(fname, 'rb') as fp:
        for chunk in fp:
            m.update(chunk)
    return m.hexdigest()

Implementation of accepted answer for Django:

import hashlib
from django.db import models


class MyModel(models.Model):
    file = models.FileField()  # any field based on django.core.files.File

    def get_hash(self):
        hash = hashlib.md5()
        for chunk in self.file.chunks(chunk_size=8192):
            hash.update(chunk)
        return hash.hexdigest()

I don't like loops. Based on @Nathan Feger:

md5 = hashlib.md5()
with open(filename, 'rb') as f:
    functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None)
md5.hexdigest()

import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
    strip1 = i.strip('\n')
    hash_object = hashlib.md5(strip1.encode())
    hash2 = hash_object.hexdigest()
    print hash2

I'm not sure that there isn't a bit too much fussing around here. I recently had problems with md5 and files stored as blobs on MySQL so I experimented with various file sizes and the straightforward Python approach, viz:

FileHash=hashlib.md5(FileData).hexdigest()

I could detect no noticeable performance difference with a range of file sizes 2Kb to 20Mb and therefore no need to 'chunk' the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer's ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you're using MySQL, don't forget the md5() and sha1() functions already there.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow