Data compression - Entropy
-
29-09-2020 - |
Question
Let's say I have an alphabet $$\Sigma = \{A, B, C, D, E\}$$
with probabilities $$P(A) = P(B) = P(C) = 0.25 \text{ and } P(D)=P(E) = 0.125.$$
I know that the entropy then is: $$H(\Sigma) = 3 \cdot 0.25 \cdot \log 4 + 2 \cdot 0.125 \cdot \log 8 = 2.25.$$
My question now is: What does this mean in relation to the lower limit of compression? How many bits will I at least need to compress a text that consists of the above alphabet?
Solution
The idea is to decode the more frequently used symbols with less number of bits than the less used ones
So ur example means we can compress more if we decode A, B, C in less bits than E, D rather than equiprobable decoding By Huffman Coding A, B, C is represented by 2 bits (that's log 4); while D, E take 3 bits (log 8) This way ur expected coding size is minimal (2.25* text length) because u expect ur file to have 0.25 of its characters as A,... 0.125 as E,...
I hope this makes it clear...