Question

I'm currently using the linux md5sum command in a bash script on a very lightweight (low processor/low memory) linux device to return and record the checksums on thousands of similarly-named 32MB files in a single directory.

md5sum ./file* >fingerprint.txt

The next day, I repeat the process on the same set of files and programmatically compare the results from the prior day's hashes. When I find that the fingerprint of a file has changed between day1 and day2 I take action on that specific file. If the file remained unchanged I take no action and continue my comparison.

The problem that I'm running into is that the md5 method takes a LONG time to process on each file. The comparison needs to be completed within a certain time-frame and I'm starting to bump into incidents where the entire process simply takes too long.

Is there some other method/tool I could be using to reliably perform this kind of comparison? (note, it's not adequate enough to perform a date comparison on the files and the file sizes remain a constant 32MB)

Was it helpful?

Solution

MD5 is supposed to be fast among cryptographic hash functions. But any given implementation may make choices which, on a specific machine, imply suboptimal performances. What kind of hardware do you use ? Processor type and L1 cache size are quite important.

You may want to have a look at sphlib: this is a library implementing many cryptographic hash functions, in C (optimized, but portable) and Java. The C code can be compiled with an additional "small footprint" flag which helps on small embedded platforms (mainly due to L1 cache size issues). Also, the code comes with a md5sum-like command-line utility, and a speed benchmark tool.

Among the hash functions, MD4 is usually the fastest, but on some platforms Panama, Radiogatun[32] and Radiogatun[64] can achieve similar or better performance. You may also want to have a look at some of the SHA-3 candidates, in particular Shabal, which is quite fast on small 32-bit systems.

Important note: some hash functions are "broken", in that it is possible to create collisions: two distinct input files, which hash to the same value (exactly what you want to avoid). MD4 and MD5 are thus "broken". However, a collision must be done on purpose; you will not hit one out of (bad) luck (probabilities are smaller than having a "collision" due to a hardware error during the computation). If you are in a security-related situation (someone may want to actively provoke a collision) then things are more difficult. Among those I cite, the Radiogatun and Shabal functions are currently unbroken.

OTHER TIPS

Ways to speed it up:

  • If you have multiple cores you could use more than one md5hash process at a time. But I suspect that your problem is disk access, in which case this may not help.
  • Do you really need to do MD5 hash? Check the modification date/time, size and INODE instead of the hash for a quick check
  • Consider Performing the quick check daily, and the slow MD5 check weekly

I suspect you don't really need to do an MD5 hash of every file every time, and you might be better off carefully considering your actual requirements, and what is the minimal solution which will meet them.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top