Best approach to storing scientific data sets on disk C++

Question 1

Assuming your data sets really are large enough to merit (e.g., instead of 30,000 elements, a 30,000x30,000 array of doubles), you might want to consider STXXL. It provides interfaces that are intended to (and largely succeed at) imitate those of the collections in the C++ standard library, but are intended to work with data too large to fit in memory.

Question 2

Smack some sense into yourself and use a production-grade library such as HDF5. So you found it too complicated, but did you find its high-level APIs ?

If you don't like that answer, try one of the emerging array databases such as SciDB, rasdaman or MonetDB. I suspect though, that if you have baulked at HDF5 you'll baulk at any of these.

In my view, and experience, it is worth the effort to learn how to properly use a tool such as HDF5 if you are going to be working with large scientific data sets for any length of time. If you pick up a tool such as a NoSQL database, which was not designed for the task at hand, then, while it may initially be easier to use, eventually (before very long would be my guess) it will lack features you need or want and you will find yourself having to program around its deficiencies.

Pick one of the right tools for the job and learn how to use it properly.

Question 3

I have been working on scientific computing for years, and I think HDF5 or NetCDF is a good data format for you to work with. It can provide efficient parallel read/wirte, which is important for dealing with big data.

An alternate solution is to use array database, like SciDB, MonetDB, or RasDaMan. However, it will be kinda painful if you try to load HDF5 data into an array database. I once tried to load HDF5 data into SciDB, but it requires a series of data transformations. You need to know if you will query the data often or not. If not often, then the time-consuming loading may be unworthy.

You may be interested in this paper. It can allow you to query the HDF5 data directly by using SQL.