문제

I have a tree data structure that I want to save to disk. Thus, HDF5 with its internal tree structure seemed to be the perfect candidate. However, so far the data overhead is massive, by a factor of 100!

A test tree contains roughly 100 nodes, where leaves usually contain no more than 2 or 3 data items (like doubles). If I take the entire tree and just pickle it, it is about 21kB large. Yet, if I use PyTables and map the tree structure one to one to the HDF5 file, the file takes 2.4MB (!) disk space. Is the overhead that big?

The problem is that the overhead does not seem to be constant but linearly scales with the size of my tree data (as well as increasing nodes as increasing data per leaf, i.e. enlarging rows of the leaf tables).

Did I miss something regarding PyTables, like enabling compression (I thought PyTables does it by default)? What could possibly be the reason for this massive overhead?

Thanks a lot!

도움이 되었습니까?

해결책

Ok, so I have found a way to massively reduce the file size. The point is, despite my prior believes, PyTables does NOT apply compression per default.

You can achieve this by using Filters.

Here is an example how that works:

   import pytables as pt

   hdf5_file = pt.openFile(filename = 'myhdf5file.h5', 
                           mode='a', 
                           title='How to compress data') 
   # for pytables >= 3 the method is called `open_file`, 
   # other methods are renamed analogously

   myfilters = Filters(complevel=9, complib='zlib')

   mydescitpion = {'mycolumn': pt.IntCol()} # Simple 1 column table

   mytable = hdf5_file.createTable(where='/', name='mytable',
                                     description=mydescription,
                                     title='My Table',
                                     filters=myfilters)
   #Now you can happily fill the table...

The important line here is Filters(complevel=9, complib='zlib'). It specifies the compression level complevel and the compression algorithm complib. Per default the level is set to 0, that means compression is disabled, whereas 9 is the highest compression level. For details on how compression works: HERE IS A LINK TO THE REFERENCE.

Next time, I better stick to RTFM :-) (although I did, but I missed the line "One of the beauties of PyTables is that it supports compression on tables and arrays, although it is not used by default")

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top