Question

I have 64 bit values that I want to compress by exploiting the fact that only a portion somewhere in the middle contains data and before and after that are zeroes.

Say the actual data is l bits long and padded with n 0s in front and m 0s at the end such that n + l + m = 64. Instead of transmitting / storing 64 bits, I can transmit l bits plus whatever I need to encode the position of the data in the 64-bit interval.

For example, say I was storing l, m and the data bits, then I would restore the original 64-bit pattern by reading l, reading l bits of data, reading m and shifting the data m bits to the left.

The smallest overhead I could come up with is two times 6 bits for storing either two of l, n and m (each can be between 0 and 64). Is it possible to reduce that number?

Was it helpful?

Solution

l can be from 0 to 64, so don't send l, send n and m, since they can both be zero, and don't need to go up to 64 (they simply need to be able to add to 64).

The l bits must start and end with a 1, so they do not need to be transmitted.

send 6 bits for n
send up to 6 bits for m (see below)
calculate l = 64 - (n + m)
if l = 0, the number is 0, don't send anything else
if l = 1, the number is 1 * 2^m, don't send anything else
if l = 2, the number is 3 * 2^m, don't send anything else
send the middle l - 2 bits.

Maximum overhead = 10 bits.

The reduction in the bits for m is because
if n > 32 then you know m < 32, so only needs 5 bits
if n > 48 then you know m < 16, so only needs 4 bits
if n > 56 then you know m < 8, so only needs 3 bits
if n > 60 then you know m < 4, so only needs 2 bits
if n = 63 then you know m < 2, so only needs 1 bit

OTHER TIPS

Your analysis sounds right for single vlaues. But if you're transmitting lots of such values together, a generic entropy encoding algorithm like gzip will probably do better, since it can eliminate the strings of zeroes quite well and also exploit redundancies in the data.

As you have stated the problem, no you cannot do better that the solution you have proposed.

However, if the distribution of the zeros in the numbers is skewed, you may be able to get better compression on average by using Huffman codes or a similar technique to represent the counts. Another possibility is to use delta coding if the zero distribution is strongly correlated from one 64bit value to the next.

In either case, you will need to use a variable number of bits to represent the numbers of zeros. And if your assumptions about skewedness or correlation turn out to be false, you may end up using more bits on average than if you had done it the simple way.

Your solution seems pretty good.
Huffman coding is another way to compress your values especially if there are values with great frequency.

It's not very difficult to implement it, but it might be overwhelming if you don't have much data to transmit.

There are 64 possible start positions n of the sequence of ones and the length of the sequence l can be no longer then 64 - n. So there a

r = sum(n = 0..63, 64 - n) + 1

sequences in total. The added one is for a sequence of all zeros. Doing some math yields the following.

r = 64 * 64 - (63 * 64) / 2 + 1
  = 2081

Representing 2081 possible values requires log2(2081) = 11.023 bits. Your suggestion to encode the information using two 6 bit numbers requiring 12 bits in total is hence optimal (under the assumption of equal distributions of all possible values).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top