Question

I have a dataframe that has multiple numerical data columns, and a 'group' column. I want to get the output of various functions over two of the columns, for each group.

Example data and function:

df = pandas.DataFrame({"Dummy":[1,2]*6, "X":[1,3,7]*4, 
                       "Y":[2,3,4]*4, "group":["A","B"]*6})

def RMSE(X):
  return(np.sqrt(np.sum((X.iloc[:,0] - X.iloc[:,1])**2)))

I want to do something like

group_correlations = df[["X", "Y"]].groupby('group').apply(RMSE)

But if I do that, the 'group' column isn't in the dataframe. If I do it the other way around, like this:

group_correlations = df.groupby('group')[["X", "Y"]].apply(RMSE)

Then the column selection doesn't work:

df.groupby('group')[['X', 'Y']].head(1)

         Dummy  X  Y group
group                     
A     0      1  1  2     A
B     1      2  3  3     B

the Dummy column is still included, so the function will calculate RMSE on the wrong data.

Is there any way to do what I'm trying to do? I know I could do a for loop over the different groups, and subselect the columns manually, but I'd prefer to do it the pandas way, if there is one.

Was it helpful?

Solution

This looks like a bug (or that grabbing multiple columns in a groupby is not implemented?), a workaround is to pass in the groupby column directly:

In [11]: df[['X', 'Y']].groupby(df['group']).apply(RMSE)
Out[11]:
group
A        4.472136
B        4.472136
dtype: float64

To see it's the same:

In [12]: df.groupby('group')[['X', 'Y']].apply(RMSE)  # wrong
Out[12]:
group
A        8.944272
B        7.348469
dtype: float64

In [13]: df.iloc[:, 1:].groupby('group')[['X', 'Y']].apply(RMSE)  # correct: ignore dummy col
Out[13]:
group
A        4.472136
B        4.472136
dtype: float64

More robust implementation:

To avoid this completely, you could change RMSE to select the columns by name:

In [21]: def RMSE2(X, left_col, right_col):
             return(np.sqrt(np.sum((X[left_col] - X[right_col])**2)))

In [22]: df.groupby('group').apply(RMSE2, 'X', 'Y')  # equivalent to passing lambda x: RMSE2(x, 'X', 'Y'))
Out[22]:
group
A        4.472136
B        4.472136
dtype: float64

Thanks to @naught101 for pointing out the sweet apply syntax to avoid the lambda.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top