Question

I have a sequence of Integers in the following format:

Integer1 Integer2 Integer3 Integer4 Integer5 ....

Each four consective integers corresponds to values of a single record. So, I cannot really order them.

What would be the best way to compress such file?

Updates:

1- The values are indpendent of each other. Each 4 consective integers represents a record, for example:

CustomerId PurchaseId Products MoneySpent

Each hold an integer value.

2- Ideally I would like to have it compressed as an object and on disk.

Thanks

Was it helpful?

Solution

The simplest and most compatible approach is to GZIP the file as you write it by wrapping your stream with GZIPOutputStream and reading it wrapped with GZIPInputStream.

InputStream in = new BufferedInputStream(new GZIPInputStream(new FileInputStream(filename)));

OutputStream out = new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(filename)));

OTHER TIPS

Using GZip is not optimal in the given way. Since your OrderID, your PurcaseId, ProductID and MoneySpent are different to each other but all OrderIds have something in common as have PurcaseId, ProductId and MoneySpent. So it is best to store those values not row wise but column wise.

Since you usually have a sort order within this table you are about to store, one column could expressed by delta value. For example if you sort your values by OrderId you can express the sequence of 10, 23, 44, 53 as +10, +13, +21, +53. These numbers are smaller and more prone to repeat than the original number.

Integer values can be expressed as variable bit length information. First you store the number of bits of the value and than the actual value. This way you save a lot of leading zeros.

For money spent you can also think about the actual repetition of typical numbers like 99, 25, 50, 49 and so on for the cent values. It is more likely that a product has the price of 49,99 but not 51,23. So spliting the money integer into two values will give you the ability to use Huffman encoding and treat special values as symbols and the rest as runlength bits.

To express the bit length, you can also use different encoding schemes one would be yet again a huffman code of 64 symbols (64 different length information) and train a coding schema. This way you will end up with very less numbers of bits instead of writing integers or even longs.

The remaining stuff can be put into gzip. This works usually better depending on the way you express the bit length since it is easier to compress leading zeros than different bit length information but every compression cost.

Another coding scheme for bit lengths is using the min max approach.

For example for the above sequence 10, 23, 44, 53 we store 10, +43 (53), +13, +23. The idea is to know that between 10 and 53 there are 43 elements. So the next value has a maximum length of 6 (2^6 = 64) bits. This way there is no need for bit length information. You just store the sequence in the oder first minimum, next maximum, next minimum, next maximum and so on.

A more efficient scheme is using minimum, maximum, middle, middle left, middle right, middle left left, middle left right, middle right left, middle right right ... . This way you have the best chance to result in smallest bit length knowledge. Using this way results in very small sizes of the integers without additional bit length information.

Using such schemes often leaves GZip a chance of further reduction by < 10% resulting in the omitting of GZip at all.

[Summary]

So GZip is simple, if you need to squeeze out more, go for column wise instead of row/entry wise. Use special knowledge of each column. If sorted use deltas as representation. Use bit lengths informations being expressed by huffman codes (one for each column) and using values for cents and dollars for product prices often result in very good compression chances. Store sorted columns by deltas and use the tree wise storage resulting in very good knowledge about the bit length to expect next.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top