Question

I currently have a data frame with millions of rows. It is currently grouped by ID and eventdate. For each ID there is a date range with corresponding weight and roll_mean_weight (which is a rolling 14 day average of weight).

df.head()
                   weight  roll_mean_weight
ID    eventdate                           
1     2013-08-23       0               NaN
      2013-08-24       0               NaN
      2013-08-25       0               NaN
      2013-08-26       0               NaN
      2013-08-27       0               NaN

I need a subset of this dataframe that keeps all the eventdate rows but only return the ID were the roll_mean_weight is >1.5 for the entire date range. So if for ID 1 there was one value in roll_mean_weight>1.5 then return all the rows for that ID.

I have tried a number of things but it seems to cut out rows. Like:

a=df.ix[(df['roll_mean_weight'] >1.5)]

But this returns only the eventdates and ID rows that match the condition.

 a.head()
                    weight  roll_mean_weight
cuid    eventdate                           
1      2013-10-21      19          1.571429
       2013-10-22       0          1.571429
       2013-10-23       0          1.571429
3      2013-10-10       3          1.571429
       2013-10-11       1          1.571429

Any ideas would be great, thanks!

Was it helpful?

Solution

Create some data (the function MultiIndex.from_product is new in 0.13.1, but not germane to the problem; it just conviently creates a mi)

In [32]: df = DataFrame(np.random.randn(20,1),
                        columns['value'],
                        index=pd.MultiIndex.from_product(
                                [list('abcde'),
                                 list(date_range('20130101',periods=4))
                                ],names=['l1','l2']))

Create some data that we know is true

In [33]: df.loc[['e']] += 10

In [34]: df.loc[['c']] += 10

In [35]: df
Out[35]: 
                   value
l1 l2                   
a  2013-01-01   1.644561
   2013-01-02   1.815067
   2013-01-03  -0.015403
   2013-01-04   0.381268
b  2013-01-01  -3.101670
   2013-01-02   2.087237
   2013-01-03   1.878045
   2013-01-04  -0.713234
c  2013-01-01   9.493884
   2013-01-02  10.333547
   2013-01-03  11.104055
   2013-01-04   8.678834
d  2013-01-01   0.862161
   2013-01-02  -1.128578
   2013-01-03  -0.896620
   2013-01-04   1.571880
e  2013-01-01   9.523882
   2013-01-02  11.980969
   2013-01-03   8.759344
   2013-01-04  11.695152

[20 rows x 1 columns]

groupby the first level; show me groups only where all of the values are > 0

In [36]: df.groupby(level=0).filter(lambda x: (x['value']>0).all())
Out[36]: 
                   value
l1 l2                   
c  2013-01-01   9.493884
   2013-01-02  10.333547
   2013-01-03  11.104055
   2013-01-04   8.678834
e  2013-01-01   9.523882
   2013-01-02  11.980969
   2013-01-03   8.759344
   2013-01-04  11.695152

[8 rows x 1 columns]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top