Question

Here's 3 example md5 hashes

$ md5 -s "1" && md5 -s "2" && md5 -s "3"
MD5 ("1") = c4ca4238a0b923820dcc509a6f75849b
MD5 ("2") = c81e728d9d4c2f636f067f89cc14862c
MD5 ("3") = eccbc87e4b5ce2fe28308fd9f2a7baf3

Say I wanted to take 8 characters from any hash. Is the beginning part of the hash particularly more "random" than the end? middle? Or are all substrings equally "random"?

Was it helpful?

Solution

I was curious myself, so I went ahead and wrote a program to test this. You'll need Crypto++ to compile the code.

Disclaimer: When it comes to cryptography, or even just mathematics in general, I know just enough to shoot myself in the foot. So, take the following results with a grain of salt and keep in mind that I only have a cursory knowledge of the tools I'm using.

I only sampled three substrings: the first 8 bytes, the middle 8 bytes, and the last 8 bytes. Long story short, they're equally random.

However, when using a smaller sample space, it appears as if the last 8 bits are slightly more random. The larger the sampling space, the closer all three substrings approach complete randomness.


1000 iterations:

First:  0.995914
Middle: 0.996546
Last:   0.998104

5000 iterations:

First:  0.998387
Middle: 0.998624
Last:   0.999501

10000 iterations:

First:  0.999614
Middle: 0.999457
Last:   1

30000 iterations:

First:  1
Middle: 1
Last:   1

"Randomness" is measured by Crypto++'s MaurerRandomnessTest class. For reference, the executable compiled from the above code has a randomness value of 0.632411 and a copy of Shakespeare's Macbeth downloaded from Project Gutenburg has a randomness value of 0.566991.

OTHER TIPS

All substrings of a good hash (and md5 is reasonably good despite being cryptographically unsafe) are equally random, so yes, take any bits you like from the string, they should be equally distributed.

Nitpick: "random" is the wrong word to use here, since hash functions are deterministic.

As for answering what you mean :), a desirable property of hash functions is achieving the Avalanche effect: basically, to have every bit of input cause drastic changes to the output. So, for a well-designed hash, every substring should be affected equally often ("be as random") as any other.

Measuring the randomness of the output of a hash function can be done using Statistical tests done on pseudo-random number generators. According to the Handbook of Applied Cryptography §5.4.4 (sample chapters available for free), there are five basic tests:

  1. Frequency test (monobit test)
  2. Serial test (two-bit test)
  3. Poker test
  4. Runs test
  5. Autocorrelation test

Then, of course, there's the Maurer's universal statistical test that kurige has already mentioned.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top