Question

I want the best compression algorithm for a list of random numbers.

List example:

224.19
225.57
226.09
222.74
222.20
222.11
223.14
540.56
538.96
540.14
540.44
336.45
338.47
340.78
156.73
160.02
158.56
156.23
55.08
56.33
54.88
53.45

I can skip the fractional part. I have huge list of numbers just like the example given above that's why it needs to be compressed.

Can you recommend something?

Was it helpful?

Solution 2

As noted in the comments, your numbers are far from random.

I would first remove the decimal point since it appears that all of your numbers can be described with two digits after the decimal point. So simply multiply all numbers by 100 when compressing, and divide by 100 when decompressing.

Second, I would delta-code the numbers by subtracting the last number from each. The first number is unchanged. The reconstruction is obvious. So then you end up with:

22419, 138, 52, -335, -54, -9, 103, 31742, -160, 118, 30, -20399,
202, 231, -18405, 329, -146, -233, -10115, 125, -145, -143

to code. Now we're getting somewhere. We have typically small deltas, with an occasional big jump. Then use variable-length integers to code them. A histogram of the deltas would then be useful to construct that code well. A simple example would be 7-bits per bytes, with the high bit being a one to indicate the end of the integer. A more complex scheme at the bit level might be more optimal, depending on the probability distribution.

OTHER TIPS

Don't use floats, use integers with some sort of control character to represent the decimal point if you need it, but if you can skip it, all the better.

Take a look at Variable byte encodings. It's advantage is that you don't need to allocate 64 bits of memory for small integers.

If your numbers have some dependency between each other you could look into Delta encoding - it stores the difference between two numbers rather than the numbers itself.

Variable byte encoding and delta encoding are used as the core methods of compressing indexes of inverted lists by Google and any other company dealing with search engines.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top