some misunderstanding in concept of Huffman algorithm

https://softwareengineering.stackexchange.com/questions/420591

20-03-2021
|

Pergunta

What is difference between Average length of codes and Average length of codewords in Huffman Algorithm? is both the same meaning? I get stuck in some facts:

I see a fact that marked as False:

for a text that it's characters get from set of n alphabets then average length of codes is O(log n).

I ran into a bigger challenge when see these two facts:

The average codeword length in Huffman's algorithm is Omega(log n).

The average codeword length in Huffman's algorithm is O(log n).

I need a very small example to clear the difference between these concepts.

Update:

This is the answer that why this fact is false: (i.e: average length of codes is O(log n) is false because:)

Solução

What is difference between Average length of codes and Average length of codewords

By definition, an alphabet is a set of things {A₁, . . . , A_N} that we might wish to encode.

A code C is a mapping from an alphabet {A₁, . . . , A_N} to a set of finite length binary strings. C(A_j) is called the codeword for symbol A_j.

The length λ_j of a codeword C(A_j) is the number of bits of this codeword.

Assume a code C over an alphabet of N symbols, and probabilities p(A_i). Let λ_i be the length of codeword C(A_i). Then, the average length of code C is the sum of p(A_i)λ_i over all the i. As a result in general, the average code length is bounded below by the shortest codeword length and bounded above by the longest codeword length.

And this definition looks exactly the same as the average length of codewords.

for a text that its characters get from set of n alphabets then average length of codes is O(log n).

so by this definition, the average length of an optimal code generated by Huffman algorithm is always at most ⌈log𝑛⌉ (an upper bound with the smallest integer that greater than log n, as discussed here). More often entropy is used to lower bound the average code length (ref).

However, from your explanation on solution, the context seems different --- average in your question is about sum of all codeword lengths divided by the total number. Then this should be Ω(log n) instead (a proof FYI), meaning the best case average length would be log n (not the worst case as O(log n) indicates). And it is consistent with your example of 1-bit, 2-bit, ...2^n-1 bit strings.

Outras dicas

The first statement doesn’t make any sense to me.

For the second, assume an alphabet with 2^32 elements, one element comes up with probability 1/2, the others with probability 0.5 / (2^32 - 1).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange