Create a multi-index df
In [35]: df = DataFrame(randn(100000,3),columns=list('ABC'))
In [36]: df['one'] = 'foo'
In [37]: df['two'] = 'bar'
In [38]: df.ix[50000:,'two'] = 'bah'
In [40]: mi = df.set_index(['one','two'])
In [41]: mi
Out[41]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 3 columns):
A 100000 non-null values
B 100000 non-null values
C 100000 non-null values
dtypes: float64(3)
Store it as a table
In [42]: store = pd.HDFStore('test.h5',mode='w')
In [43]: store.append('df',mi)
get_storer
will return the stored object (but not retrieve the data)
In [44]: store.get_storer('df').levels
Out[44]: ['one', 'two']
In [2]: store
Out[2]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable_multi,nrows->100000,ncols->5,indexers->[index],dc->[two,one])
The index levels are created as data_columns, meaning you can use them in selections This is how to select only the index
In [48]: store.select('df',columns=['one'])
Out[48]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Empty DataFrame
To select a single column and return it as a mi-frame
In [49]: store.select('df',columns=['A'])
Out[49]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 1 columns):
A 100000 non-null values
dtypes: float64(1)
To select a single column as a Series (which can also be the index as these are stored as columns). This will be quite fast.
In [2]: store.select_column('df','one')
Out[2]:
0 foo
1 foo
2 foo
3 foo
4 foo
5 foo
6 foo
7 foo
8 foo
9 foo
10 foo
11 foo
12 foo
13 foo
14 foo
...
99985 foo
99986 foo
99987 foo
99988 foo
99989 foo
99990 foo
99991 foo
99992 foo
99993 foo
99994 foo
99995 foo
99996 foo
99997 foo
99998 foo
99999 foo
Length: 100000, dtype: object
If you really want the fastest selection of only the index
In [4]: %timeit store.select_column('df','one')
100 loops, best of 3: 8.71 ms per loop
In [5]: %timeit store.select('df',columns=['one'])
10 loops, best of 3: 43 ms per loop
Or to get a complete index
In [6]: def f():
...: level_1 = store.select_column('df','one')
...: level_2 = store.select_column('df','two')
...: return MultiIndex.from_arrays([ level_1, level_2 ])
...:
In [17]: %timeit f()
10 loops, best of 3: 28.1 ms per loop
If you want the values for each levels, a pretty fast way of doing it
In [2]: store.select_column('df','one').unique()
Out[2]: array(['foo'], dtype=object)
In [3]: store.select_column('df','two').unique()
Out[3]: array(['bar', 'bah'], dtype=object)