Compression performance related to chunk size in hdf5 files

https://stackoverflow.com/questions/16786428

30-05-2022
|

Question

I would like to ask a question about the performance of compression which is related to chunk size of hdf5 files.

I have 2 hdf5 files on hand, which have the following properties. They both only contain one dataset, called "data".

File A's "data":

Type: HDF5 Scalar Dataset
No. of Dimensions: 2
Dimension Size: 5094125 x 6
Max. dimension size: Unlimited x Unlimited
Data type: 64-bit floating point
Chunking: 10000 x 6
Compression: GZIP level = 7

File B's "data":

Type: HDF5 Scalar Dataset
No. of Dimensions: 2
Dimension Size: 6720 x 1000
Max. dimension size: Unlimited x Unlimited
Data type: 64-bit floating point
Chunking: 6000 x 1
Compression: GZIP level = 7

File A's size: HDF5----19 MB CSV-----165 MB

File B's size: HDF5----60 MB CSV-----165 MB

Both of them shows great compression on data stored when comparing to csv files. However, the compression rate of file A is about 10% of original csv, while that of file B is only about 30% of original csv.

I have tried different chunk size to make file B as small as possible, but it seems that 30% is the optimum compression rate. I would like to ask why file A can achieve a greater compression while file B cannot.

If file B can also achieve, what should the chunk size be?

Is that any rule to determine the optimum chunk size of HDF5 for compression purpose?

Thanks!

La solution

Chunking doesn't really affect the compression ratio per se, except in the manner @Ümit describes. What chunking does do is affect the I/O performance. When compression is applied to an HDF5 dataset, it is applied to whole chunks, individually. This means that when reading data from a single chunk in a dataset, the entire chunk must be decompressed - possibly involving a whole lot more I/O, depending on the size of the cache, shape of the chunk, etc.

What you should do is make sure that the chunk shape matches how you read/write your data. If you generally read a column at a time, make your chunks columns, for example. This is a good tutorial on chunking.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow