Question

I need to store a large number of integers in a file. The order of the integers does not matter, so the total information content should be lower than that of an ordered list. Is there a more space-efficient way to store the numbers than as an arbitrarily ordered array?

Edit: I assume the integers to be completely random. I am really looking for a universal way to squeeze out the redundant information which is introduced by fixing a permutation.

Was it helpful?

Solution

To compress, you actually need higher information content. You cannot in general compress randomness. Lucky for you, your problem specification allows you to reorder the data. Therefore, you may sort the data, thereby increasing its information content. Then, instead of storing the list of integers, you need only store the smallest and the sequence of first differences. The first differences will be smaller than the numbers themselves, so should fit into fewer bits.

The sorted randomly generated sequence

sorted seq     (173 218 257 490 618 638 715 815 856 929 932 996)
number of bits (  6   6   6   7   7   7   7   7   7   7   7   7)

can be stored as

first diff     (173 45 39 233 128 20 77 100 41 73 3 64)
number of bits (  6  4  4   6   5  3  5   5  4  5 2  5)

Where e.g. 45 is the difference between 173 and 218, the first and second elements. These numbers require 54 bits versus 81 above. If the numbers are fairly dense in the range from which they were drawn, you may see the maximum bits for the first difference be lower than the data enabling you to use a smaller fixed bit-length. If you do not use fixed size, you must also store delimiters or use some other adaptive scheme so you can determine where one number leaves off and the next begins. If your data has a large number of duplicates, as would occur if your numbers are drawn randomly from a relatively small range, you might also look into run length encoding the zeros in the first differences.

OTHER TIPS

In general I would say no. If your numbers have some pattern or are distributed in some singular way then you should mention it.

This paper does exactly this.

Compressing Multisets with Large Alphabets

Paper: https://arxiv.org/abs/2107.09202

Code: https://github.com/facebookresearch/multiset-compression

Summary: https://twitter.com/_dsevero/status/1419661190750425102

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top