Using an 874 MiB random data file which required 2 seconds with the md5
openssl tool I was able to improve speed as follows.
- Using your first method required 21 seconds.
- Reading the entire file (21 seconds) to buffer and then updating required 2 seconds.
- Using the following function with a buffer size of 8096 required 17 seconds.
- Using the following function with a buffer size of 32767 required 11 seconds.
- Using the following function with a buffer size of 65536 required 8 seconds.
- Using the following function with a buffer size of 131072 required 8 seconds.
- Using the following function with a buffer size of 1048576 required 12 seconds.
def md5_speedcheck(path, size):
pts = time.process_time()
ats = time.time()
m = hashlib.md5()
with open(path, 'rb') as f:
b = f.read(size)
while len(b) > 0:
m.update(b)
b = f.read(size)
print("{0:.3f} s".format(time.process_time() - pts))
print("{0:.3f} s".format(time.time() - ats))
Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.
The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.