Вопрос

I am using h5py to store data using python

import h5py
def store(eigenvalues,eigenvectors,name='01_'):

        datafile = h5py.File(name+'data.h5', 'w')

        datafile['eigenvalues'] = eigenvalues
        datafile['eigenvectors'] = (eigenvectors)
        datafile.close()
        print "Successfully saved eigenvalues and eigenvectors"

It is really useful to store these large numbers But when trying to store say two columns of data only, I found saving it normal data file is more space efficient. Is there a critical data size above which h5py format storage will be more efficient? Also is there any other not-obvious advantage of using this format?

Это было полезно?

Решение

There are lots of advantages of using HDF5. As @EnricoGiampieri says, it's generally used for storing large ensembles of data, rather than just single arrays. It is also useful for storing all the associated metadata at the same time. From the HDF5 website

The HDF5 technology suite includes:

  • A versatile data model that can represent very complex data objects and a wide variety of metadata.
  • A completely portable file format with no limit on the number or size of data objects in the collection.
  • A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces.
  • A rich set of integrated performance features that allow for access time and storage space optimizations.
  • Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.

Its a hierarchical data format which is self-describing - which means that the datasets in the file are easily discoverable. It scales to very large file sizes and massively parallel I/O.

As regards compression, this is a property of an individual dataset and needs to be specified when you create that dataset. There are several different options for what compression algorithm to use - GZIP, SZIP and LZF are all supported. There is more information on the h5py wiki.

To apply compression to your file, try this:

import h5py
def store(eigenvalues,eigenvectors,name='01_'):

    datafile = h5py.File(name+'data.h5', 'w')

    eigenvalues_dset = datafile.create_dataset('eigenvalues', eigenvalues.shape, eigenvalues.dtype, compression='gzip', compression_opts=4)
    eigenvectors_dset = datafile.create_dataset('eigenvectors', eigenvalues.shape, eigenvectors.dtype, compression='gzip', compression_opts=4)

    datafile['eigenvalues'][:] = eigenvalues
    datafile['eigenvectors'][:] = (eigenvectors)
    datafile.close()
    print "Successfully saved eigenvalues and eigenvectors"

Here I've assumed that eigenvalues and eigenvectors are both numpy arrays. You should convert them if they are not (just use numpy.array(eigenvalues)). Also note that to assign the datasets, I've used [:] - this is because datafile['eigenvalues'] is an HDF5 object, while datafile['eigenvalues'][:] is the actual data in that object. The HDF5 object holds not just the data, but also attributes and metadata.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top