HDF5 : storing NumPy data

https://stackoverflow.com/questions/4133327

29-09-2019
|

Question

when I used NumPy I stored it's data in the native format *.npy. It's very fast and gave me some benefits, like this one

I could read *.npy from C code as simple binary data(I mean *.npy are binary-compatibly with C structures)

Now I'm dealing with HDF5 (PyTables at this moment). As I read in the tutorial, they are using NumPy serializer to store NumPy data, so I can read these data from C as from simple *.npy files?

Does HDF5's numpy are binary-compatibly with C structures too?

UPD :

I have matlab client reading from hdf5, but don't want to read hdf5 from C++ because reading binary data from *.npy is times faster, so I really have a need in reading hdf5 from C++ (binary-compatibility) So I'm already using two ways for transferring data - *.npy (read from C++ as bytes,from Python natively) and hdf5 (access from Matlab) And if it's possible,want to use the only one way - hdf5, but to do this I have to find a way to make hdf5 binary-compatibly with C++ structures, pls help, If there is some way to turn-off compression in hdf5 or something else to make hdf5 binary-compatibly with C++ structures - tell me where i can read about it...

Solution

I feel your pain. I've been dealing extensively with massive amounts of data stored in HDF5 formatted files, and I've gleaned a few bits of information you may find useful.

If you are in "control" of the file creation (and writing the data - even if you use an API) you should be able to largely entirely circumvent the HDF5 libraries.

If you the output datasets are not chunked, they will be written contiguously. As long as you aren't specifying any byte-order conversion in your datatype definitions (i.e. you are specifying the data should be written in native float/double/integer format) you should be able to achieve "binary-compatibility" as you put it.

To solve my problem I wrote an HDF5 file parser using the file specification http://www.hdfgroup.org/HDF5/doc/H5.format.html

With a fairly simple parser you should be able to identify the offset to (and size of) any dataset. At that point simply fseek and fread (in C, that is, perhaps there is a higher level approach you can take in C++).

If your datasets are chunked, then more parsing is necessary to traverse the b-trees used to organize the chunks.

The only other issue you should be aware of is handling any (or eliminating) any system dependent structure padding.

OTHER TIPS

The proper way to read hdf5 files from C is to use the hdf5 API - see this tutorial. In principal it is possible to directly read the raw data from the hdf5 file as you would with the .npy file, assuming you have not used advanced storage options such as compression in your hdf5 file. However this essentially defies the whole point of using the hdf5 format and I cannot think of any advantage to doing this instead of using the proper hdf5 API. Also note that the API has a simplified high level version which should make reading from C relatively painless.

HDF5 takes care of binary compatibility of structures for you. You simply have to tell it what your structs consist of (dtype) and you'll have no problems saving/reading record arrays - this is because the type system is basically 1:1 between numpy and HDF5. If you use H5py I'm confident to say the IO should be fast enough provided you use all native types and large batched reads/writes - the entire dataset of allowable. After that it depends on chunking and what filters (shuffle, compression for example) - it's also worth noting sometimes those can speed up by greatly reducing file size so always look at benchmarks. Note that the the type and filter choices are made on the end creating the HDF5 document.

If you're trying to parse HDF5 yourself, you're doing it wrong. Use the C++ and C apis if you're working in C++/C. There are examples of so called "compound types" on the HDF5 groups website.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow