Arithmetic coding, the termination symbol, and the empty string

https://stackoverflow.com//questions/9696110

lossless-compression

13-12-2019
|

Question

Suppose the source alphabet is a, b, c with a as the termination symbol and so the unit interval is correspondingly divided as [0, P(a), P(a)+P(b), 1].

Strings consisting of a bunch of b's and c's ending with an a (the termination symbol) are valid for encoding. Strings with an a in middle are considered invalid for encoding.

So its easy to construct strings with encodings lying in the interval [P(a), 1). But does arithmetic coding assign any string an encoding in the interval [0, P(a))? Would the empty string qualify as being encoded to a bitstring lying in [0, P(a))? Since the empty string can be thought of as the string "a" or as just the termination symbol.

Since devoting space to encoding the empty string would seem pointless why not have the first division of the unit interval be [0, (P(b)-P(a))/(1-P(a)), 1] which corresponds to mapping [P(a), P(a)+P(b), 1] to fill up the unit interval. Then subsequent refining divisions would use [0, P(a), P(a)+P(b), 1] as usual.

Solution

Yes, the empty string would be in that interval (ie, 0). This is redundant in that you can also infer that the string is zero-length from the length of the encoded representation, so you could exclude it. More generally, if you can infer that any symbol is impossible, based on previous portions of the string, then you can exclude it (giving the other symbols more of the interval) and save a little space. But if the only case in which you do this is with the first symbol, then the space saving is likely to be too negligible to justify the complexity of an extra special case.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow