Question

I would like to ask a question about the performance of compression which is related to chunk size of hdf5 files.

I have 2 hdf5 files on hand, which have the following properties. They both only contain one dataset, called "data".

File A's "data":

  1. Type: HDF5 Scalar Dataset
  2. No. of Dimensions: 2
  3. Dimension Size: 5094125 x 6
  4. Max. dimension size: Unlimited x Unlimited
  5. Data type: 64-bit floating point
  6. Chunking: 10000 x 6
  7. Compression: GZIP level = 7

File B's "data":

  1. Type: HDF5 Scalar Dataset
  2. No. of Dimensions: 2
  3. Dimension Size: 6720 x 1000
  4. Max. dimension size: Unlimited x Unlimited
  5. Data type: 64-bit floating point
  6. Chunking: 6000 x 1
  7. Compression: GZIP level = 7

File A's size: HDF5----19 MB CSV-----165 MB

File B's size: HDF5----60 MB CSV-----165 MB

Both of them shows great compression on data stored when comparing to csv files. However, the compression rate of file A is about 10% of original csv, while that of file B is only about 30% of original csv.

I have tried different chunk size to make file B as small as possible, but it seems that 30% is the optimum compression rate. I would like to ask why file A can achieve a greater compression while file B cannot.

If file B can also achieve, what should the chunk size be?

Is that any rule to determine the optimum chunk size of HDF5 for compression purpose?

Thanks!

Était-ce utile?

La solution

Chunking doesn't really affect the compression ratio per se, except in the manner @Ümit describes. What chunking does do is affect the I/O performance. When compression is applied to an HDF5 dataset, it is applied to whole chunks, individually. This means that when reading data from a single chunk in a dataset, the entire chunk must be decompressed - possibly involving a whole lot more I/O, depending on the size of the cache, shape of the chunk, etc.

What you should do is make sure that the chunk shape matches how you read/write your data. If you generally read a column at a time, make your chunks columns, for example. This is a good tutorial on chunking.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top