PyTables and HDF5: Massive overhead for tree data

Question

Ok, so I have found a way to massively reduce the file size. The point is, despite my prior believes, PyTables does NOT apply compression per default.

You can achieve this by using Filters.

Here is an example how that works:

   import pytables as pt

   hdf5_file = pt.openFile(filename = 'myhdf5file.h5', 
                           mode='a', 
                           title='How to compress data') 
   # for pytables >= 3 the method is called `open_file`, 
   # other methods are renamed analogously

   myfilters = Filters(complevel=9, complib='zlib')

   mydescitpion = {'mycolumn': pt.IntCol()} # Simple 1 column table

   mytable = hdf5_file.createTable(where='/', name='mytable',
                                     description=mydescription,
                                     title='My Table',
                                     filters=myfilters)
   #Now you can happily fill the table...

The important line here is Filters(complevel=9, complib='zlib'). It specifies the compression level complevel and the compression algorithm complib. Per default the level is set to 0, that means compression is disabled, whereas 9 is the highest compression level. For details on how compression works: HERE IS A LINK TO THE REFERENCE.

Next time, I better stick to RTFM :-) (although I did, but I missed the line "One of the beauties of PyTables is that it supports compression on tables and arrays, although it is not used by default")