Grouped function between 2 columns in a pandas.DataFrame?

https://stackoverflow.com/questions/22139053

19-10-2022
|

Question

I have a dataframe that has multiple numerical data columns, and a 'group' column. I want to get the output of various functions over two of the columns, for each group.

Example data and function:

df = pandas.DataFrame({"Dummy":[1,2]*6, "X":[1,3,7]*4, 
                       "Y":[2,3,4]*4, "group":["A","B"]*6})

def RMSE(X):
  return(np.sqrt(np.sum((X.iloc[:,0] - X.iloc[:,1])**2)))

I want to do something like

group_correlations = df[["X", "Y"]].groupby('group').apply(RMSE)

But if I do that, the 'group' column isn't in the dataframe. If I do it the other way around, like this:

group_correlations = df.groupby('group')[["X", "Y"]].apply(RMSE)

Then the column selection doesn't work:

df.groupby('group')[['X', 'Y']].head(1)

         Dummy  X  Y group
group                     
A     0      1  1  2     A
B     1      2  3  3     B

the Dummy column is still included, so the function will calculate RMSE on the wrong data.

Is there any way to do what I'm trying to do? I know I could do a for loop over the different groups, and subselect the columns manually, but I'd prefer to do it the pandas way, if there is one.

Solution

This looks like a bug (or that grabbing multiple columns in a groupby is not implemented?), a workaround is to pass in the groupby column directly:

In [11]: df[['X', 'Y']].groupby(df['group']).apply(RMSE)
Out[11]:
group
A        4.472136
B        4.472136
dtype: float64

To see it's the same:

In [12]: df.groupby('group')[['X', 'Y']].apply(RMSE)  # wrong
Out[12]:
group
A        8.944272
B        7.348469
dtype: float64

In [13]: df.iloc[:, 1:].groupby('group')[['X', 'Y']].apply(RMSE)  # correct: ignore dummy col
Out[13]:
group
A        4.472136
B        4.472136
dtype: float64

More robust implementation:

To avoid this completely, you could change RMSE to select the columns by name:

In [21]: def RMSE2(X, left_col, right_col):
             return(np.sqrt(np.sum((X[left_col] - X[right_col])**2)))

In [22]: df.groupby('group').apply(RMSE2, 'X', 'Y')  # equivalent to passing lambda x: RMSE2(x, 'X', 'Y'))
Out[22]:
group
A        4.472136
B        4.472136
dtype: float64

Thanks to @naught101 for pointing out the sweet apply syntax to avoid the lambda.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow