Question

I'm looking to encode user_ids in a long list of call records. The parts of these records that takes up the most space are the symbols for the caller and receiver. I will create a map that assigns the most active callers shorter symbols---this will help keep the overall size of the files (and therefore the I/O time) down.

I know in advance how many times each symbol will be used---in other words I know the relative probability distribution. Furthermore, it is not important that the codes that are produced be "prefix free" such as Huffman codes. So what's the best encoding scheme, i.e., the one that will deliver the most compression and for which a quick implementation exists?

An answer should not only point to a compression scheme, it should also point to an implementation of that encoding scheme.

Was it helpful?

Solution

For general-purpose lossless encoding with a known probability distribution, aside from Huffman coding, the other "textbook" answer is arithmetic coding.

In practice, there are a variety of implementations. See these general-purpose coders. Each has different properties. Without further information, we can't give you a more precise answer.

OTHER TIPS

@conradlee: re "In what cases is arithmetic coding better than Huffman coding?" In terms of compression, nearly always. If you have a symbol,S, with a probability, Ps, then the ideal number of bits to code it with, bs, is -log(Ps)/log(2). For example, if Ps is 1/3 then bs is ~ 1.585 bits. With Huffman you have to round up or down to the nearest whole number of bits (so the compression ratio will decrease). Arithmetic encoding will store it with a fractional number of bits.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top