Read entire group in an HDF5 file using a pandas.HDFStore

https://stackoverflow.com/questions/22897597

28-06-2023
|

Question

I have an HDF file like that:

>>> dataset.store
... <class 'pandas.io.pytables.HDFStore'>
... File path: ../data/data_experiments_01-02-03.h5
... /exp01/user01    frame_table  (typ->appendable,nrows->221,ncols->124,indexers->[index])
... /exp01/user02    frame_table  (typ->appendable,nrows->163,ncols->124,indexers->[index])
... /exp01/user03    frame_table  (typ->appendable,nrows->145,ncols->124,indexers->[index])
... /exp02/user01    frame_table  (typ->appendable,nrows->194,ncols->124,indexers->[index])
... /exp02/user02    frame_table  (typ->appendable,nrows->145,ncols->124,indexers->[index])
... /exp03/user03    frame_table  (typ->appendable,nrows->348,ncols->124,indexers->[index])
... /exp03/user01    frame_table  (typ->appendable,nrows->240,ncols->124,indexers->[index])

from which I want to retrieve all the users (userXY) from one of the experiments (exp0Z) and append them into a single big DataFrame. I have tried store.get('exp03') obtaining the following error:

>>> store.get('exp03')
... 
... ---------------------------------------------------------------------------
... TypeError                                 Traceback (most recent call last)
... <ipython-input-109-0a2e29e9e0a4> in <module>()
... ----> 1 dataset.store.get('/exp03')
... 
... /Library/Python/2.7/site-packages/pandas/io/pytables.pyc in get(self, key)
...     613         if group is None:
...     614             raise KeyError('No object named %s in the file' % key)
... --> 615         return self._read_group(group)
...     616 
...     617     def select(self, key, where=None, start=None, stop=None, columns=None,
... 
... /Library/Python/2.7/site-packages/pandas/io/pytables.pyc in _read_group(self, group, **kwargs)
...    1277 
...    1278     def _read_group(self, group, **kwargs):
... -> 1279         s = self._create_storer(group)
...    1280         s.infer_axes()
...    1281         return s.read(**kwargs)
... 
... /Library/Python/2.7/site-packages/pandas/io/pytables.pyc in _create_storer(self, group, format, value, append, **kwargs)
...    1160                 else:
...    1161                     raise TypeError(
... -> 1162                         "cannot create a storer if the object is not existing "
...    1163                         "nor a value are passed")
...    1164             else:
... 
... TypeError: cannot create a storer if the object is not existing nor a value are passed

I can retrieve a single user by calling store.get('exp03/user01'), so I guess it is possible to iterate the store.keys() and append manually the retrieved dataframes, but I wonder if it is possible to do so in a single call to store.get() or other similar method.

EDIT: Note that dataset is a class that contains my pandas.HDFstore

Solution

This is not implemented, though could be a nice feature. (and FYI I would not have it set by default in .get(...) because its not explicit enough (e.g. should it ALWAYS read ALL the tables, too much guessing), but could have an argument to control which sub-tables I suppose. If you are interested in implemented this, pls put to github.

You can use some internal functions to make this pretty easy though (and you could even pass a where to each of the selects.

In [13]: store = pd.HDFStore('test.h5',mode='w')

In [14]: store.append('df/foo1',DataFrame(np.random.randn(10,2)))

In [15]: store.append('df/foo2',DataFrame(np.random.randn(10,2)))

In [16]: pd.concat([ store.select(node._v_pathname) for node in store.get_node('df') ])
Out[16]: 
          0         1
0 -0.495847 -1.449251
1 -0.494721  1.572560
2  1.219985  0.280878
3 -0.419651  1.975562
4 -0.489689 -2.712342
5 -0.022466 -0.238129
6 -1.195269 -0.028390
7 -0.192648  1.220730
8  1.331892  0.950508
9 -0.790354 -0.743006
0 -0.761820  0.847983
1 -0.126829  1.304889
2  0.667949 -1.481652
3  0.030162 -0.111911
4 -0.433762 -0.596412
5 -1.110968  0.411241
6 -0.428930  0.086527
7 -0.866701 -1.286884
8 -0.649420  0.227999
9 -0.100669 -0.205232

[20 rows x 2 columns]

In [17]: store.close()

Keep in mind though if I were doing this, their is little reason to have SEPARATE nodes when the data is the same; its MUCH more efficient to have it in a single table with say a field that indicates its name or id or whatever.

Almost always I use different nodes for heteregenous data (not necessary different dtypes, but different 'types' of data).

That said, you can organize however you like!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow