Question

What are the main benefits from storing data in HDF? And what are the main data science tasks where HDF is really suitable and useful?

Was it helpful?

Solution

Perhaps a good way to paraphrase the question is, what are the advantages compared to alternative formats?

The main alternatives are, I think: a database, text files, or another packed/binary format.

The database options to consider are probably a columnar store or NoSQL, or for small self-contained datasets SQLite. The main advantage of the database is the ability to work with data much larger than memory, to have random or indexed access, and to add/append/modify data quickly. The main *dis*advantage is that it is much slower than HDF, for problems in which the entire dataset needs to be read in and processed. Another disadvantage is that, with the exception of embedded-style databases like SQLite, a database is a system (requiring admnistration, setup, maintenance, etc) rather than a simple self-contained data store.

The text file format options are XML/JSON/CSV. They are cross-platform/language/toolkit, and are a good archival format due to the ability to be self-describing (or obvious :). If uncompressed, they are huge (10x-100x HDF), but if compressed, they can be fairly space-efficient (compressed XML is about the same as HDF). The main disadvantage here is again speed: parsing text is much, much slower than HDF.

The other binary formats (npy/npz numpy files, blz blaze files, protocol buffers, Avro, ...) have very similar properties to HDF, except they are less widely supported (may be limited to just one platform: numpy) and may have specific other limitations. They typically do not offer a compelling advantage.

HDF is a good complement to databases, it may make sense to run a query to produce a roughly memory-sized dataset and then cache it in HDF if the same data would be used more than once. If you have a dataset which is fixed, and usually processed as a whole, storing it as a collection of appropriately sized HDF files is not a bad option. If you have a dataset which is updated often, staging some of it as HDF files periodically might still be helpful.

To summarize, HDF is a good format for data which is read (or written) typically as a whole; it is the lingua franca or common/preferred interchange format for many applications due to wide support and compatibility, decent as an archival format, and very fast.

P.S. To give this some practical context, my most recent experience comparing HDF to alternatives, a certain small (much less than memory-sized) dataset took 2 seconds to read as HDF (and most of this is probably overhead from Pandas); ~1 minute to read from JSON; and 1 hour to write to database. Certainly the database write could be sped up, but you'd better have a good DBA! This is how it works out of the box.

OTHER TIPS

One benefit is wide support - C, Java, Perl, Python, and R all have HDF5 bindings.

Another benefit is speed. I haven't ever seen it benchmarked, but HDF is supposed to be faster than SQL databases.

I understand that it is very good when used with both large sets of scientific data and time series data - network monitoring, usage tracking, etc.

I don't believe there is a size limitation for HDF files (although OS limits would still apply.

To add, check out ASDF in particular their paper ASDF: A new data format for Astronomy; ASDF tries to improve upon HDF5 and the paper describes some downsides of HDF5 format.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top