Implemented flancor's answer here: https://github.com/sh1boot/anagram/blob/master/bitgram.c
Using http://ftp.debian.org/debian/pool/main/s/scowl/wbritish_7.1-1_all.deb as the
dictionary, I get the following proportions of perfect codes (both positive and
negative matches from the hash comparison) using these coding schemes:
order | frequency| alphabet| frequency| alphabet| frequency
code | prime | unary | unary | huffman | huffman
-------|----------|----------|----------|----------|----------
64-bit | 93.95% | 100.00% | 100.00% | 100.00% | 100.00%
32-bit | 24.60% | 69.14% | 74.23% | 86.57% | 90.58%
28-bit | 15.20% | 41.75% | 60.27% | 68.16% | 74.34%
24-bit | 8.21% | 13.14% | 38.55% | 42.67% | 50.34%
20-bit | 3.72% | 3.35% | 16.73% | 19.41% | 25.59%
16-bit | 1.31% | 0.82% | 4.86% | 5.51% | 9.18%
This shows that any variant of the bit-packed histogram encoding can capture
the whole dictionary in perfect 64-bit hashes, and more than two thirds of the
dictionary in 32-bit hashes; whereas the prime products scheme struggles to
reach 100% even in 64-bit, and can't even cover a quarter of the dictionary in
a 32-bit hash.
This list does contain a lot of apostrophes, though, and almost all (but not
all) of the 64-bit misses include apostrophes. Putting that character in its
proper place in the frequency table might help.
Some additional tests:
frequency/huffman, overflowed:
88aff7eb53ef9940: Pneumonoultramicroscopicsilicovolcanoconiosis
frequency/huffman, 64-bit perfect:
17415fbc30c2ebf7: Chargoggagoggmanchauggagoggchaubunagungamaugg
018a6b5dda873fba: Supercalifragilisticexpialidocious
00021ae1bcf50dba: Pseudopseudohypoparathyroidism
00070003dd83b5fb: Floccinaucinihilipilification
00002b800ebbdf7a: Antidisestablishmentarianism
000006c6ab75b3f9: Honorificabilitudinitatibus
0000000421b2ad94: glyptoperichthys
0000000201b2ad94: pterygoplichtys
It's probably the case that the tuning should be done against words that are likely to fall near the boundary, as they may have properties (eg., language of origin) which strongly influence their distribution, and making shorter words longer doesn't matter until they overflow.