Is any substring of a hash (md5, sha1) more “random” than another?
Question
Here's 3 example md5 hashes
$ md5 -s "1" && md5 -s "2" && md5 -s "3"
MD5 ("1") = c4ca4238a0b923820dcc509a6f75849b
MD5 ("2") = c81e728d9d4c2f636f067f89cc14862c
MD5 ("3") = eccbc87e4b5ce2fe28308fd9f2a7baf3
Say I wanted to take 8 characters from any hash. Is the beginning part of the hash particularly more "random" than the end? middle? Or are all substrings equally "random"?
Solution
I was curious myself, so I went ahead and wrote a program to test this. You'll need Crypto++ to compile the code.
Disclaimer: When it comes to cryptography, or even just mathematics in general, I know just enough to shoot myself in the foot. So, take the following results with a grain of salt and keep in mind that I only have a cursory knowledge of the tools I'm using.
I only sampled three substrings: the first 8 bytes, the middle 8 bytes, and the last 8 bytes. Long story short, they're equally random.
However, when using a smaller sample space, it appears as if the last 8 bits are slightly more random. The larger the sampling space, the closer all three substrings approach complete randomness.
1000 iterations:
First: 0.995914
Middle: 0.996546
Last: 0.998104
5000 iterations:
First: 0.998387
Middle: 0.998624
Last: 0.999501
10000 iterations:
First: 0.999614
Middle: 0.999457
Last: 1
30000 iterations:
First: 1
Middle: 1
Last: 1
"Randomness" is measured by Crypto++'s MaurerRandomnessTest class. For reference, the executable compiled from the above code has a randomness value of 0.632411
and a copy of Shakespeare's Macbeth downloaded from Project Gutenburg has a randomness value of 0.566991
.
OTHER TIPS
All substrings of a good hash (and md5 is reasonably good despite being cryptographically unsafe) are equally random, so yes, take any bits you like from the string, they should be equally distributed.
Nitpick: "random" is the wrong word to use here, since hash functions are deterministic.
As for answering what you mean :), a desirable property of hash functions is achieving the Avalanche effect: basically, to have every bit of input cause drastic changes to the output. So, for a well-designed hash, every substring should be affected equally often ("be as random") as any other.
Measuring the randomness of the output of a hash function can be done using Statistical tests done on pseudo-random number generators. According to the Handbook of Applied Cryptography §5.4.4 (sample chapters available for free), there are five basic tests:
- Frequency test (monobit test)
- Serial test (two-bit test)
- Poker test
- Runs test
- Autocorrelation test
Then, of course, there's the Maurer's universal statistical test that kurige has already mentioned.