Question

I need a 4-character hash. At the moment I am taking the first 4 characters of a md5() hash. I am hashing a string which is 80 characters long or less. Will this lead to collision? or, what is the chance of collision, assuming I'll hash less than 65,536 (164) different elements?

Was it helpful?

Solution

Surprisingly high indeed. As you can see from this graph of an approximate collision probability (formula from the wikipedia page), with just a few hundred elements your probability of having a collision is over 50%.

Note, of course, if you're facing the possibility of an attacker providing the string, you can probably assume that it's 100% - scanning to find a collision in a 16-bit search space can be done almost instantaneously on any modern PC. Or even any modern cell phone, for that matter.

OTHER TIPS

Well, each character of md5 is a hex bit. That means it can have one of 16 possible values. So if you're only using the first 4 "hex-bits", that means you can have 16 * 16 * 16 * 16 or 16^4 or 65536 or 2^16 possibilities.

So, that means that the total available "space" for results is only 16 bits wide. Now, according to the Birthday Attack/Problem, there are the following chances for collision:

  • 50% chance -> 300 entries
  • 1% chance -> 36 entries
  • 0.0000001% chance -> 2 entries.

So there is quite a high chance for collisions.

Now, you say you need a 4 character hash. Depending on the exact requirements, you can do:

  • 4 hex-bits for 16^4 (65,536) possible values
  • 4 alpha bits for 26^4 (456,976) possible values
  • 4 alpha numeric bits for 36^4 (1,679,616) possible values
  • 4 ascii printable bits for about 93^4 (74,805,201) possible values (assuming ASCII 33 -> 126)
  • 4 full bytes for 256^4 (4,294,967,296) possible values.

Now, which you choose will depend on the actual use case. Does the hash need to be transmitted to a browser? How are you storing it, etc.

I'll give an example of each (In PHP, but should be easy to translate / see what's going on):

4 Hex-Bits:

$hash = substr(md5($data), 0, 4);

4 Alpha bits:

$hash = substr(base_convert(md5($data), 16, 26)0, 4);
$hash = str_replace(range(0, 9), range('S', 'Z'), $hash);

4 Alpha Numeric bits:

$hash = substr(base_convert(md5($data), 16, 36), 0, 4);

4 Printable Assci Bits:

$hash = hash('md5', $data, true); // We want the raw bytes
$out = '';
for ($i = 0; $i < 4; $i++) {
    $out .= chr((ord($hash[$i]) % 93) + 33);
}

4 full bytes:

$hash = substr(hash('md5', $data, true), 0, 4); // We want the raw bytes

4 first characters contains 4*4 = 16 bits of data, so collision will be definitely at 65536 elements, and, due to birthday attack, it will be found much faster. You should use more bits of hash.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top