The following simple script runs an infinite loop that fetches 2 random 20-byte sequences, calculates CRC16 and checks if there is a collision. Continuous evaluation of this loop de facto estimates collision percentage:
#!/usr/bin/env perl
use Digest::CRC qw(crc16);
open(my $f, '<', '/dev/urandom');
my $n = 0;
my $coll = 0;
while (1) {
read $f, $randstr1, 20;
read $f, $randstr2, 20;
my $crc1 = crc16($randstr1);
my $crc2 = crc16($randstr2);
$n++;
$coll++ if $crc1 == $crc2;
printf "percent of collisions = %.6f%%\n", $coll * 100.0 / $n if ($n % 100000 == 0);
}
From what I get on my computer, collision percentage seems to be around 0.0016%
(or 1e-5
, or "1 in 100_000"), which is way worse than predicted estimates based on ideal hashing distribution of a 16-bit hash (such as 2^16 / 2^160).
UPDATE: I see you've clarified that 20 bytes are not just fully random bytes, but fall into range of [a-z0-9]
. Here's the updated version that estimates collisions in that alphabet:
#!/usr/bin/env perl
use Digest::CRC qw(crc16);
my $n = 0;
my $coll = 0;
my @chars = ('a'..'z', '0'..'9');
sub randstr() {
my $res;
foreach (1..20) { $res .= $chars[rand @chars]; }
return $res;
}
while (1) {
my $crc1 = crc16(randstr());
my $crc2 = crc16(randstr());
$n++;
$coll++ if $crc1 == $crc2;
printf "percent of collisions = %.4f%%\n", $coll * 100.0 / $n if ($n % 100000 == 0);
}
It yields approximately the same result, about 0.0016%
.