Basic Arithmetic Coding

https://stackoverflow.com/questions/21719014

10-10-2022
|

Question

I have to explain to my class how to do basic arithmetic coding on a small message. I've been investigating lots of documents and reading a lot and I can say I theoretically understand how this method works, but still have some questions.

I'm stepping through these examples (first example, second page) - we have 'eaii!' message, and we want to code it using the arithmetic method.

In the example, it sets

Symbol      Probability      Range
a             .2            [0 , 0.2)
e             .3            [0.2 , 0.5)
i             .1            [0.5 , 0.6)
o             .2            [0.6 , 0.8)
u             .1            [0.8 , 0.9)
!             .1            [0.9 , 1.0)

My first question is, how did it set probabilities?, my logic tells me that if i have two 'i' symbols then that symbol should have the highest probability, shouldn't it?

Also how did it determined which range to start from and other ranges afterwards??

Another example was coding the message 'abc', which was set like this:

Symbol      Probability      Range
a             .7            [0 , 0.7)
b             .1            [0.7 , 0.8)
c             .2            [0.8 , 1.0)

I also don't understand why the first symbol has substantially greater probability than the others and even if it was an order of appearance thing, I don't understand how it set it to 0.7, like why not 0.8 or 0.5.

I hope I made myself clear and I'd appreciate any kind of help.

Solution

They are imagining a fixed model for the data that was established long before that specific message is to be encoded. The model was in principle constructed from a large ensemble of such messages, so there is no reason to believe that eaii! by itself should match the probabilities in the model. Of course, the model is just for illustration purposes, and no more real than the eaii! message. (Though I think I said exactly that the other day when I was pulling something out of the oven.)

The order of the symbols in the model is arbitrary. It just needs to be the same model on both ends. It is of course important that the probabilities add up to one.

The second model is simply another arbitrary model to illustrate how a symbol can be coded in less than a bit, when it has a probability greater than 1/2. For that model, each a in a series of a's would take a little over half a bit.

OTHER TIPS

I think the probability is for the sample.

To determine which probability to test for a specific sample, just count each character, and divide, for each, its number of occurrences by the total number of characters. (sum is 1).

Note that the arithmetic algorithm is efficient when there is a huge repetition of the same character.

For example :

aaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaabaaaaa will compress very good abfpababbahnajdapkalamkmdamlkapaaapokpokpdq will not compress very good (try Huffman)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow