Question

This is a question I ran into in school settings, but it keeps bugging me so I decided to ask it here.

In Huffman compression, fixed-length sequences (characters) are encoded with variable-length sequences. The code sequence length depends on the frequencies (or probabilities) of the source characters.

My questions is: what is the minimum highest character frequency, with which that character will be encoded by a single bit?

Was it helpful?

Solution

It turns out that the answer is 0.4, that is, if the highest frequency p is p >= 0.4, 1-bit code for the corresponding character is guaranteed. In other words, this is a sufficient condition.

It is also true that p >= 1/3 is a necessary condition. That is, there may be examples where 0.4 > p >= 1/3, and the shortest code is 1-bit, but there are no cases like that if p < 1/3.

The way to reason about that is to look at the way the code tree is constructed, in particular at the frequencies of the 3 last surviving subtrees. A proof appears in Johnsen, "On the redundancy of binary Huffman codes", 1980 (unfortunately this is a paid link).

OTHER TIPS

In general, about 50% of the incoming symbol stream would have to consist of a given symbol for Huffman to code it as a single bit. The reason for this is that due to how Huffman coding works (the one symbol's encoding cannot be the prefix of another), by encoding a symbol with a single bit, you are requiring that the first bit for every other symbol be the opposite value (i.e. if one symbol is encoded as 0, everything else must start with 1 plus at least one more bit). Since you're eliminating half of the possible encoding space for any given bit length, you need to be gaining a way to encode at least half of the symbols being input in order to break even.

Note that there is a special case where the symbol space only consists of 3 symbols. In such a case, whichever symbol has the greatest frequency will be encoded with 1 bit (since the other two will be the 2nd-bit variations of whichever first-bit value isn't chosen) - if 2 or more have equally greater probability, either one could be encoded. Thus, in the 3-symbol case it's possible that a symbol with, say, 34% probability could theoretically be coded as a single bit (say, 0) while the other two might have 33% or lower probabilities and be coded as 10 and 11.

So if you're considering all possibilities, then technically anything 1/3 or above could potentially be encoded as a single bit (in the 3-symbols case).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top