Structuring data in an HDF store

https://stackoverflow.com/questions/22243753

10-06-2023
|

Вопрос

I'm working with a large number datasets, each a pandas DataFrame, which I need to access from disk because of the size of them. From what I'm read, it looks like HDF would be a good way to work with them, but I'm a bit puzzled about the best way of structuring the data because of various bits of metadata which go with each DataFrame. If I were to store the data in memory I could probably use something like a namedtuple (although this wouldn't allow for easy queries):

DataSet = namedtuple('DataSet', 'model method id data')

data is the attribute holding the actual dataframe, and the other fields are text. However, I now need to include a series of results fields, which I would probably do in memory using a dict of DataFrames. If I were to dump this to a mongodb, I would probably have something that looks like this:

[{
    model: 'mir',
    method: 'rfl_max',
    id: 's0001',
    data: <DataFrame>,
    results: [
        {
            option_r: 10,
            window: 30,
            data: <DataFrame>
        },
        ...
    ]
},
....
]

My basic question is can I efficiently apply this structure to HDF? Specifically:

Does HDF support this sort of nesting, and if so how do I do it?
Looking up data like this is efficient in mongo because of how it uses indexes. Is the same true for HDF, e.g. could I efficiently find all results matching a specific method and option_r?
My limited experience with HDF is through pandas, which appears to only allow Series, DataFrames on Panels to be stored. Is this a realy limitation, or am I just doing something wrong?

Alternatively, does anyone know of a file based mongodb implementation which might serve my purposes?

Решение

HDFStore supports hierarchical indexing, see here.

You can store attributes attached to a particular node, see here. This is generally limited to a small amount of meta-data.

HDF5 is quite efficient at storing/searching the actual data, e.g. a DataFrame. The structure is up 2 you, but it is not meant to compete, rather it can complement mongodb. mongo is nice at keeeping/searching these 'json-like' kinds of nested structures.

You can always keep a reference to the actual location of the data (the DataFrame/Series) rather than the actual data in a mongo db.

HDF5 will be orders of magnitude faster for storing/searching the actual data (except for very small datasets).

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow