Perfect/ideal hash to isolate anagrams

Question 1

If you are using n-bit hashes with an alphabet of size m, you can get a unique hash for anagrams up to (n-m) characters long using the approach I described here. This makes collision detection unnecessary but it does limit your word size depending on the size of the alphabet and your available space.

To allow words of any length, I would use n-1 bits to do that hash for words up to (n-m-1) characters in length, and save the last bit to signal that the word is m characters or longer. In those cases you would use the remaining n-1 bits for your prime-number or other hashing algorithm, but of course you would have do collision detection anytime you got multiple words in those buckets. Since in a real-world application the majority of the words will occupy the shorter word lengths, you'll drastically cut the collision detection needed for the longer words.

Question 2

Implemented flancor's answer here: https://github.com/sh1boot/anagram/blob/master/bitgram.c

Using http://ftp.debian.org/debian/pool/main/s/scowl/wbritish_7.1-1_all.deb as the dictionary, I get the following proportions of perfect codes (both positive and negative matches from the hash comparison) using these coding schemes:

order  | frequency|  alphabet| frequency|  alphabet| frequency
code   |    prime |    unary |    unary |  huffman |  huffman
-------|----------|----------|----------|----------|----------
64-bit |   93.95% |  100.00% |  100.00% |  100.00% |  100.00%
32-bit |   24.60% |   69.14% |   74.23% |   86.57% |   90.58%
28-bit |   15.20% |   41.75% |   60.27% |   68.16% |   74.34%
24-bit |    8.21% |   13.14% |   38.55% |   42.67% |   50.34%
20-bit |    3.72% |    3.35% |   16.73% |   19.41% |   25.59%
16-bit |    1.31% |    0.82% |    4.86% |    5.51% |    9.18%

This shows that any variant of the bit-packed histogram encoding can capture the whole dictionary in perfect 64-bit hashes, and more than two thirds of the dictionary in 32-bit hashes; whereas the prime products scheme struggles to reach 100% even in 64-bit, and can't even cover a quarter of the dictionary in a 32-bit hash.

This list does contain a lot of apostrophes, though, and almost all (but not all) of the 64-bit misses include apostrophes. Putting that character in its proper place in the frequency table might help.

Some additional tests:

frequency/huffman, overflowed:

88aff7eb53ef9940: Pneumonoultramicroscopicsilicovolcanoconiosis

frequency/huffman, 64-bit perfect:

17415fbc30c2ebf7: Chargoggagoggmanchauggagoggchaubunagungamaugg
018a6b5dda873fba: Supercalifragilisticexpialidocious
00021ae1bcf50dba: Pseudopseudohypoparathyroidism
00070003dd83b5fb: Floccinaucinihilipilification
00002b800ebbdf7a: Antidisestablishmentarianism
000006c6ab75b3f9: Honorificabilitudinitatibus
0000000421b2ad94: glyptoperichthys
0000000201b2ad94: pterygoplichtys

It's probably the case that the tuning should be done against words that are likely to fall near the boundary, as they may have properties (eg., language of origin) which strongly influence their distribution, and making shorter words longer doesn't matter until they overflow.