Question

I have a data frame with user_ids stored as an indexed frame_table in an HDFStore. Also in this HDF file is another table with actions the user took. I want to grab all of the actions taken by 1% of the users. The procedure is as follows:

#Get 1% of the user IDs
df_id = store.select('df_user_id', columns = ['id'])
1pct_users = rnd.sample(df_id.id.unique(), 0.01*len(df_id.id.unique()))
df_id = df_id[df_id.id.isin(1pct_users)]

Now I want to go back and get all of the additional info that describes the actions taken by these users from frame_tables identically indexed as df_user_id. As per this example and this question I have done the following:

1pct_actions = store.select('df_actions', where = pd.Term('index', 1pct_users.index))

This simply provides an empty data frame. In fact, if I copy and paste the example in the previous pandas doc link I also get an empty data frame. Did something change about Term in recent pandas? I'm on pandas 0.12.

I'm not tied to any particular solution. As long as I can get hdfstore indices from a lookup on the df_id table (which is fast) and then directly pull those indices from the other frame tables.

Was it helpful?

Solution

Here is the way to do it in 0.12. In 0.13, where can be an indexer (e.g. an array of locations, so this is much easier, see (Selecting using a where mask)[http://pandas.pydata.org/pandas-docs/dev/io.html#advanced-queries], then 2nd example down.

In [2]: df = DataFrame(dict(A=list(range(5)),B=list(range(5))))

In [3]: df
Out[3]: 
   A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4

In [4]: store = pd.HDFStore('test.h5',mode='w')

In [5]: store.append('df',df)

Select and return a coordinate object (just a wrapped location array) according to some where

In [6]: c = store.select_as_coordinates('df', ['index<3'])

Where accepts the Coordinate objects (and you can use them with any table, here would be your 'df_action' table)

In [7]: store.select('df', where=c)
Out[7]: 
   A  B
0  0  0
1  1  1
2  2  2

In [8]: c
Out[8]: <pandas.io.pytables.Coordinates at 0x4669590>

In [9]: c.values
Out[9]: array([0, 1, 2])

If you want to manipulate this, then just assign the positions you want to the Coordinate object before passing to select. (As I said above, this 'hack' is going away in 0.13, and you don't need this intermediate object)

In [8]: c.values = np.array([0,1])

In [9]: store.select('df', where=c)
Out[9]: 
   A  B
0  0  0
1  1  1

store.close()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top