fast md5sum on millions of strings in bash/ubuntu

https://stackoverflow.com/questions/1961752

21-09-2019
|

Question

I need the MD5 sums of 3 million strings or so in a bash script on ubuntu. 3 million strings -> 3 million MD5 hashes. The trivial implementation takes about 0.005sec per string. That's over 4 hours. What faster alternatives exist? Is there a way to pump groups of strings into md5sum?

#time md5sum running 100 times on short strings
#each iteration is ~0.494s/100 = 0.005s
time (for i in {0..99}; do md5sum <(echo $i); done) > /dev/null

real    0m0.494s
user    0m0.120s
sys     0m0.356s

A good solution will include a bash/Perl script that takes a list of strings from stdin and outputs a list of their MD5 hashes.

Solution

It's not hard to do in C (or Perl or Python) using any of the many md5 implementations -- at its heart md5 is a hash function that goes from a character vector to a character vector.

So just write a outer program that reads your 3 million strings, and then feed them one by one to the md5 implementation of your choice. That way you have one program startup rather than 3 million, and that alone will save you time.

FWIW in one project I used the md5 implementation (in C) by Christophe Devine, there is OpenSSL's as well and I am sure CPAN will have a number of them for Perl too.

Edit: Ok, couldn't resist. The md5 implementation I mentioned is e.g. inside this small tarball. Take the file md5.c and replace the (#ifdef'ed out) main() at the bottom with this

int main( int argc, char *argv[] ) {
    FILE *f;
    int j;
    md5_context ctx;
    unsigned char buf[1000];
    unsigned char md5sum[16];

    if( ! ( f = fopen( argv[1], "rb" ) ) ) {
        perror( "fopen" );
        return( 1 );
    }

    while( fscanf(f, "%s", buf) == 1 ) {
        md5_starts( &ctx );
        md5_update( &ctx, buf, (uint32) strlen((char*)buf) );
        md5_finish( &ctx, md5sum );

        for( j = 0; j < 16; j++ ) {
            printf( "%02x", md5sum[j] );
        }
        printf( " <- %s\n", buf );
    }
    return( 0 );
}

build a simple standalone program as e.g. in

/tmp$ gcc -Wall -O3 -o simple_md5 simple_md5.c

and then you get this:

# first, generate 300,000 numbers in a file (using 'little r', an R variant)
/tmp$ r -e'for (i in 1:300000) cat(i,"\n")' > foo.txt

# illustrate the output
/tmp$ ./simple_md5 foo.txt | head
c4ca4238a0b923820dcc509a6f75849b <- 1
c81e728d9d4c2f636f067f89cc14862c <- 2
eccbc87e4b5ce2fe28308fd9f2a7baf3 <- 3
a87ff679a2f3e71d9181a67b7542122c <- 4
e4da3b7fbbce2345d7772b0674a318d5 <- 5
1679091c5a880faf6fb5e6087eb1b2dc <- 6
8f14e45fceea167a5a36dedd4bea2543 <- 7
c9f0f895fb98ab9159f51fd0297e236d <- 8
45c48cce2e2d7fbdea1afc51c7c6ad26 <- 9
d3d9446802a44259755d38e6d163e820 <- 10

# let the program rip over it, suppressing stdout
/tmp$ time (./simple_md5 foo.txt > /dev/null)

real    0m1.023s
user    0m1.008s
sys     0m0.012s
/tmp$

So that's about a second for 300,000 (short) strings.

OTHER TIPS

perl -MDigest::MD5=md5_hex -lpe '$_ = md5_hex $_'

#~/sw/md5$ time (for i in {0..99}; do md5sum <(echo $i); done) > /dev/null

real    0m0.220s
user    0m0.084s
sys 0m0.160s
#~/sw/md5$ time (python test.py `for i in {0..99}; do echo $i; done`) > /dev/null

real    0m0.041s
user    0m0.024s
sys 0m0.012s

The python code is five times as fast for this small samples, for larger samples the difference is much bigger because of the missing spawns. 1k samples are 0.033s to 2.3s :) The script is:

#!/usr/bin/env python
import hashlib, sys

for arg in sys.argv[1:]:
  print hashlib.md5(arg).hexdigest()

I'm without a machine to test it on right now, but is md5sum <<< "$i" any faster than md5sum <(echo $i)? The <<< syntax would avoid the overhead of forking a sub-process for the echo and would pass $i directly to md5sum on standard input.

To get better performance you would probably need to use a different program, or create a C program that calls one of the publicly available md5 hash APIs.

Another option is to spawn multiple md5 calls at once to take advantage of multiple cores. Each loop through you could spawn 8 calls, the first 7 using & at the end (to indicate asynchronous). If you have 4-8 cores available, this could speed it up by a factor of 8.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow