Question

In an effort to accelerate fast-out behaviour on testing strings for anagrams, I came up with a prime-based hashing scheme -- although it looks like I wasn't the first.

The basic idea is to map letters to prime numbers, and to compute the product of these primes. Any rearrangement of the letters will have the same product, and if the result can be arbitrarily large then no combination of other letters can produce the same result.

I had initially envisioned this as just a hash. Eventually the product would overflow and start to alias other letter combinations. However, by mapping the most frequent letters to the smallest primes the product grows slowly and can often avoid overflow altogether. In this case we get a perfect hash, giving both definite positive and negative results without additional testing.

What's notable is that it doesn't fill the coding space very efficiently before overflowing. No result will have any prime factors greater than 103, and the distribution of small primes is fixed and not necessarily a great match to letter frequency.

Now I'm wondering if there's something substantially better than this. Something that covers more results with perfect hashes and has strong distribution in the remaining cases.

The densest coding scheme I can think of is to sort the letters and then pack them into a word with an entropy coder. In this scheme the letter frequency will obviously be enormously biased because of the range constraints applied to each position (eg., the likelihood of a sorted array starting with z is substantially lower than that of a sorted array ending with a z).

That sounds like a whole lot of work, though -- and I can't see it guaranteeing to give good distribution in the overflow case.

Perhaps there's a better set of factors to map the letters to, and a better way to detect when the risk of aliasing has started. Or a hashing scheme that doesn't rely on multiplication? Something that's easy to calculate?

So that's:

  • A perfect hash for as much real-world input as possible (for some sensible number of bits).
  • A strong hash for remaining cases, with a means of distinguishing the two cases.
  • Easy to calculate.

English language constraints (26 letters with typical English-like word structure) will do fine. Multi-byte coding schemes are a whole other problem.

C code preferred because I understand it.

Was it helpful?

Solution

If you are using n-bit hashes with an alphabet of size m, you can get a unique hash for anagrams up to (n-m) characters long using the approach I described here. This makes collision detection unnecessary but it does limit your word size depending on the size of the alphabet and your available space.

To allow words of any length, I would use n-1 bits to do that hash for words up to (n-m-1) characters in length, and save the last bit to signal that the word is m characters or longer. In those cases you would use the remaining n-1 bits for your prime-number or other hashing algorithm, but of course you would have do collision detection anytime you got multiple words in those buckets. Since in a real-world application the majority of the words will occupy the shorter word lengths, you'll drastically cut the collision detection needed for the longer words.

OTHER TIPS

Implemented flancor's answer here: https://github.com/sh1boot/anagram/blob/master/bitgram.c

Using http://ftp.debian.org/debian/pool/main/s/scowl/wbritish_7.1-1_all.deb as the dictionary, I get the following proportions of perfect codes (both positive and negative matches from the hash comparison) using these coding schemes:

order  | frequency|  alphabet| frequency|  alphabet| frequency
code   |    prime |    unary |    unary |  huffman |  huffman
-------|----------|----------|----------|----------|----------
64-bit |   93.95% |  100.00% |  100.00% |  100.00% |  100.00%
32-bit |   24.60% |   69.14% |   74.23% |   86.57% |   90.58%
28-bit |   15.20% |   41.75% |   60.27% |   68.16% |   74.34%
24-bit |    8.21% |   13.14% |   38.55% |   42.67% |   50.34%
20-bit |    3.72% |    3.35% |   16.73% |   19.41% |   25.59%
16-bit |    1.31% |    0.82% |    4.86% |    5.51% |    9.18%

This shows that any variant of the bit-packed histogram encoding can capture the whole dictionary in perfect 64-bit hashes, and more than two thirds of the dictionary in 32-bit hashes; whereas the prime products scheme struggles to reach 100% even in 64-bit, and can't even cover a quarter of the dictionary in a 32-bit hash.

This list does contain a lot of apostrophes, though, and almost all (but not all) of the 64-bit misses include apostrophes. Putting that character in its proper place in the frequency table might help.

Some additional tests:

frequency/huffman, overflowed:

88aff7eb53ef9940: Pneumonoultramicroscopicsilicovolcanoconiosis

frequency/huffman, 64-bit perfect:

17415fbc30c2ebf7: Chargoggagoggmanchauggagoggchaubunagungamaugg
018a6b5dda873fba: Supercalifragilisticexpialidocious
00021ae1bcf50dba: Pseudopseudohypoparathyroidism
00070003dd83b5fb: Floccinaucinihilipilification
00002b800ebbdf7a: Antidisestablishmentarianism
000006c6ab75b3f9: Honorificabilitudinitatibus
0000000421b2ad94: glyptoperichthys
0000000201b2ad94: pterygoplichtys

It's probably the case that the tuning should be done against words that are likely to fall near the boundary, as they may have properties (eg., language of origin) which strongly influence their distribution, and making shorter words longer doesn't matter until they overflow.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top