Question

I am a little confused about how they calculate "average number of bits per symbol". Is this calculated by taking the probability of each character and multiplying it by the lg(1/probability) like regular entropy, or some other way?

Also, if this is true, how do they know for sure what the average occurrence of a letter is?

Was it helpful?

Solution

I really shouldn't answer this because I don't know much about compression, but I can say:

  • How is "bits per symbol" defined?

You are correct; it's regular entropy defined as -Σp·log(p). Note that this isn't actually frequency of character but frequency of message. ie, the following set of messages

{ abcdefghijklmnopqrstuvwxyz }

Looks great analyzed letter-by-letter, but has an entropy of 0.

  • How can you know what the average occurrence of a letter is?

It is theoretically impossible to know for sure, unless you know the exact process by which the message is generated. You have to use some heuristic. Like taking a large sample and counting, or looking for patterns that you know are signs of redundancy. Such as english text, etc.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top