Maximum number of different numbers, Huffman Compression

https://stackoverflow.com/questions/8885703

huffman-code

29-10-2019
|

Question

I want to compress many 32bit number using huffman compression.

Each number may appear multiple times, and I know that every number will be replaced with some bit sequences:

111 010 110 1010 1000 etc...

Now, the question: How many different numbers can be added to the huffman tree before the length of the binary sequence exceeds 32bits?

The rule of generating sequences (for those who don't know) is that every time a new number is added you must assign it the smallest binary sequence possible that is not the prefix of another.

Solution

You seem to understand the principle of prefix codes.

Many people (confusingly) refer to all prefix codes as "Huffman codes".

There are many other kinds of prefix codes -- none of them compress data into any fewer bits than Huffman compression (if we neglect the overhead of transmitting the frequency table), but many of them get pretty close (with some kinds of data) and have other advantages, such as running much faster or guaranteeing some maximum code length ("length-limited prefix codes").

If you have large numbers of unique symbols, the overhead of the Huffman frequency table becomes large -- perhaps some other prefix code can give better net compression.

Many people doing compression and decompression in hardware have fixed limits for the maximum codeword size -- many image and video compression algorithms specify a "length-limited Huffman code".

The fastest prefix codes -- universal codes -- do, in fact, involve a series of bit sequences that can be pre-generated without regard to the actual symbol frequencies. Compression programs that use these codes, as you mentioned, associate the most-frequent input symbol to the shortest bit sequence, the next-most-frequent input symbol to the next-shorted bit sequence, and so on.

For example, some compression programs use Fibonacci codes (a kind of universal code), and always associate the most-frequent symbol to the bit sequence "11", the next-most-frequent symbol to the bit sequence "011", the next to "0011", the next to "1011", and so on.

The Huffman algorithm produces a code that is similar in many ways to a universal code -- both are prefix codes. But, as Cyan points out, the Huffman algorithm is slightly different than those universal codes. If you have 5 different symbols, the Huffman tree will contain 5 different bit sequences -- however, the exact bit sequences generated by the Huffman algorithm depend on the exact frequencies. One document may have symbol counts of { 10, 10, 20, 40, 80 }, leading to Huffman bit sequences { 0000 0001 001 01 1 }. Another document may have symbol counts of { 40, 40, 79, 79, 80 }, leading to Huffman bit sequences { 000 001 01 10 11 }. Even though both situations have exactly 5 unique symbols, the actual Huffman code for the most-frequent symbol is very different in these two compressed documents -- the Huffman code "1" in one document, the Huffman code "11" in another document. If, however, you compressed those documents with the Fibonacci code, the Fibonacci code for the most-frequent symbol is always the same -- "11" in every document.

For Fibonacci in particular, the first 33-bit Fibonacci code is "31 zero bits followed by 2 one bits", representing the value F(33) = 3,524,578 . And so 3,524,577 unique symbols can be represented by Fibonacci codes of 32 bits or less.

One of the more counter-intuitive features of prefix codes is that some symbols (the rare symbols) are "compressed" into much longer bit sequences. If you actually have 2^32 unique symbols (all possible 32 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 32 bits or less. If you actually have 2^8 unique symbols (all possible 8 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 8 bits or less. By allowing the compressor to expand rare values -- to use more than 8 bits to store a rare symbol that we know can be stored in 8 bits -- or use more than 32 bits to store a rare symbol that we know can be stored in 32 bits -- that frees up the compressor to use less than 8 bits -- or less than 32 bits -- to store the more-frequent symbols.

In particular, if I use Fibonacci codes to compress a table of values, where the values include all possible 32 bit numbers, one must use Fibonacci codes up to N bits long, where F(N) = 2^32 -- solving for N I get N = 47 bits for the least-frequently-used 32-bit symbol.

OTHER TIPS

Huffman is about compression, and compression requires a "skewed" distribution to work (assuming we are talking about normal, order-0, entropy).

The worst situation regarding Huffman tree depth is when the algorithm creates a degenerated tree, i.e. with only one leaf per level. This situation can happen if the distribution looks like a Fibonacci serie.

Therefore, the worst distribution sequence looks like this : 1, 1, 1, 2, 3, 5, 8, 13, ....

In this case, you fill the full 32-bit tree with only 33 different elements.

Note, however, that to reach a 32 bit-depth with only 33 elements, the most numerous element must appear 3 524 578 times.

Therefore, since suming all Fibonacci numbers get you 5 702 886, you need to compress at least 5 702 887 numbers to start having a risk of not being able to represent them with a 32-bit huffman tree.

That being said, using an Huffman tree to represent 32-bits numbers requires a considerable amount of memory to calculate and maintain the tree.

[Edit] A simpler format, called "logarithm approximation", gives almost the same weight to all symbols. In this case, only the total number of symbols is required.

It computes very fast : say for 300 symbols, you will have some using 8 bits, and others using 9 bits. The formula to decide how many of each type :

9 bits : (300-256)*2 = 44*2 = 88 ; 8 bits : 300 - 88 = 212

Then you can distribute the numbers as you wish (preferably the most frequent ones using 8 bits, but that's not important).

This version scales up to 32 bits, meaning basically no restriction.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow