Question

I'm currently exploring HDF5. I've read the interesting comments from the thread "Evaluating HDF5" and I understand that HDF5 is a solution of choice for storing the data, but how do you query it ? For example, say I've a big file containing some identifiers : Is there a way to quickly know if a given identifier is present in the file ?

Was it helpful?

Solution

I think the answer is "not directly".

Here are some of the ways I think you could achieve the functionality.

Use groups:

A hierarchy of groups could be used in the form of a Radix Tree to store the data. This probably doesn't scale too well though.

Use index datasets:

HDF has a reference type which could be used to link to a main table from a separate index tables. After writing the main data, other datasets sorted on other keys with references can be used. For example:

MainDataset (sorted on identifier)
0: { A, "C", 2 }
1: { B, "B", 1 }
2: { C, "A", 3 }

StringIndex
0: { "A", Reference ("MainDataset", 2) }
1: { "B", Reference ("MainDataset", 1) }
2: { "C", Reference ("MainDataset", 0) }

IntIndex
0: { 1, Reference ("MainDataset", 1) }
1: { 2, Reference ("MainDataset", 0) }
2: { 3, Reference ("MainDataset", 2) }

In order to use the above a binary search will have to be written when looking up the field in the Index tables.

In memory Index:

Depending on the size of the dataset it may be just as easy to use an in memory index that is read/written to its own dataset using something like "boost::serialize".

HDF5-FastQuery:

This paper (and also this page) describe the use of bitmap indices to perform complex queries over a HDF dataset. I have not tried this.

OTHER TIPS

H5Lexists was introduced for this in HDF5 1.8.0:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Exists

You can also iterate over the things that are in an HDF5 file with H5Literate:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Iterate

But you can also manually check for previous versions by trying to open a dataset. We use code like this to deal with any version of HDF5:

bool DoesDatasetExist(const std::string& rDatasetName)
{
#if H5_VERS_MAJOR>=1 && H5_VERS_MINOR>=8
    // This is a nice method for testing existence, introduced in HDF5 1.8.0
    htri_t dataset_status = H5Lexists(mFileId, rDatasetName.c_str(), H5P_DEFAULT);
    return (dataset_status>0);
#else
    bool result=false;
    // This is not a nice way of doing it because the error stack produces a load of 'HDF failed' output.
    // The "TRY" macros are a convenient way to temporarily turn the error stack off.
    H5E_BEGIN_TRY
    {
        hid_t dataset_id = H5Dopen(mFileId, rDatasetName.c_str());
        if (dataset_id>0)
        {
            H5Dclose(dataset_id);
            result = true;
        }
    }
    H5E_END_TRY;
    return result;
#endif
}

Perhaps this paper will be very helpful to you. http://www.cse.ohio-state.edu/~wayi/papers/HDF5_SQL.pdf

Is this what you need? You can query a HDF5 data with SQL, which is a declarative language.

Unlike FastQuery, there is no index in this work, but our group also provides an open source version with bitmap index.

Moreover, if you want to complete the query (especially for aggregation) in real time, you should consider approximate aggregation or online aggregation. I have also developed some products which directly work on HDF5.

Furthermore, some queries over HDF5 can be much more complex than what you may have seen in relational databases. Some queries are array-oriented rather than relational table-oriented. Just google "SciQL", then you can find some complex and unique query types for array-based data model, which can certainly be applied to HDF5. Do you need to perform those kind of queries? I have also developed a product to support some of the complicated query types there.

What do you mean by identifier ? If you mean an attribute, check this tutorial. In C:

status = H5Aread(attr_id, mem_type_id, buf);
status = H5Awrite(attr_id, mem_type_id, buf);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top