Question

I am developing a system which needs to store a hash for a structure 20 bytes maybe less in length. However, in order to optimize the process of looking up the hash in a series of hashes, we want to reduce the size of the hash as much as possible.

So my question is this, is there a relationship between the amount of data we feed into a crc16 hash and the probability of its collision with another entry of the same length? If so, which is the most optimal length for this?

18 of the bytes fall within the ascii table (a-z, 0-9), and the remaining range between 0 and 10

Was it helpful?

Solution

The following simple script runs an infinite loop that fetches 2 random 20-byte sequences, calculates CRC16 and checks if there is a collision. Continuous evaluation of this loop de facto estimates collision percentage:

#!/usr/bin/env perl

use Digest::CRC qw(crc16);

open(my $f, '<', '/dev/urandom');
my $n = 0;
my $coll = 0;

while (1) {
    read $f, $randstr1, 20;
    read $f, $randstr2, 20;
    my $crc1 = crc16($randstr1);
    my $crc2 = crc16($randstr2);

    $n++;
    $coll++ if $crc1 == $crc2;

    printf "percent of collisions = %.6f%%\n", $coll * 100.0 / $n if ($n % 100000 == 0);
}

From what I get on my computer, collision percentage seems to be around 0.0016% (or 1e-5, or "1 in 100_000"), which is way worse than predicted estimates based on ideal hashing distribution of a 16-bit hash (such as 2^16 / 2^160).

UPDATE: I see you've clarified that 20 bytes are not just fully random bytes, but fall into range of [a-z0-9]. Here's the updated version that estimates collisions in that alphabet:

#!/usr/bin/env perl

use Digest::CRC qw(crc16);

my $n = 0;
my $coll = 0;
my @chars = ('a'..'z', '0'..'9');

sub randstr() {
    my $res;
    foreach (1..20) { $res .= $chars[rand @chars]; }
    return $res;
}

while (1) {
    my $crc1 = crc16(randstr());
    my $crc2 = crc16(randstr());

    $n++;
    $coll++ if $crc1 == $crc2;

    printf "percent of collisions = %.4f%%\n", $coll * 100.0 / $n if ($n % 100000 == 0);
}

It yields approximately the same result, about 0.0016%.

OTHER TIPS

A good 16-bit hash should have a probability of 2^-16 of collision given two distinct inputs. CRC16 isn't a very good hash, but unless you have an adversary picking the inputs, it should be good enough for your purposes.

Keep in mind the birthday paradox. You'll start to get collisions after you hash about 2^8 items.

Whether or not you get a likely hash collision depends on the contents of the data, not its quantity. If it's not deliberately chosen to collide, then you should be pretty safe in a situation like this where the size of the data is 10x the size of hash. That said, it's still a 16 bit hash and the likelihood of collisions is quite high by modern standards.

The probability of a hash collision does not depend on the length of the message, so long as the entropy (number of significant bits) of the message is greater than or equal to the number of bits in the hash, and that it is a good hash that well mixes the bits of the input into each hash.

In your case you have about 100 bits of entropy, so as long as you have a good hash of length 100 bits or less, then the collision probability will depend only on the number of bits in the hash and the number of opportunities you have for collisions. This answer shows how to calculate the probability of a collision.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top