Question

I have a data frame with user_ids stored as an indexed frame_table in an HDFStore. Also in this HDF file is another table with actions the user took. I want to grab all of the actions taken by 1% of the users. The procedure is as follows:

#Get 1% of the user IDs
df_id = store.select('df_user_id', columns = ['id'])
1pct_users = rnd.sample(df_id.id.unique(), 0.01*len(df_id.id.unique()))
df_id = df_id[df_id.id.isin(1pct_users)]

Now I want to go back and get all of the additional info that describes the actions taken by these users from frame_tables identically indexed as df_user_id. As per this example and this question I have done the following:

1pct_actions = store.select('df_actions', where = pd.Term('index', 1pct_users.index))

This simply provides an empty data frame. In fact, if I copy and paste the example in the previous pandas doc link I also get an empty data frame. Did something change about Term in recent pandas? I'm on pandas 0.12.

I'm not tied to any particular solution. As long as I can get hdfstore indices from a lookup on the df_id table (which is fast) and then directly pull those indices from the other frame tables.

Était-ce utile?

La solution

Here is the way to do it in 0.12. In 0.13, where can be an indexer (e.g. an array of locations, so this is much easier, see (Selecting using a where mask)[http://pandas.pydata.org/pandas-docs/dev/io.html#advanced-queries], then 2nd example down.

In [2]: df = DataFrame(dict(A=list(range(5)),B=list(range(5))))

In [3]: df
Out[3]: 
   A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4

In [4]: store = pd.HDFStore('test.h5',mode='w')

In [5]: store.append('df',df)

Select and return a coordinate object (just a wrapped location array) according to some where

In [6]: c = store.select_as_coordinates('df', ['index<3'])

Where accepts the Coordinate objects (and you can use them with any table, here would be your 'df_action' table)

In [7]: store.select('df', where=c)
Out[7]: 
   A  B
0  0  0
1  1  1
2  2  2

In [8]: c
Out[8]: <pandas.io.pytables.Coordinates at 0x4669590>

In [9]: c.values
Out[9]: array([0, 1, 2])

If you want to manipulate this, then just assign the positions you want to the Coordinate object before passing to select. (As I said above, this 'hack' is going away in 0.13, and you don't need this intermediate object)

In [8]: c.values = np.array([0,1])

In [9]: store.select('df', where=c)
Out[9]: 
   A  B
0  0  0
1  1  1

store.close()
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top