Question

How do I normalize a multiindex dataframe?

Let's say I have the dataframe:

d = pd.DataFrame([["a",1,3],["a",2,2],["b",4,4],["b",5,8]], 
                  columns=["name","value1","value2"])

how do I calculate the normalized values for each "name"?

I know how to normalize a basic dataframe:

d = (d-d.mean(axis=0))/data.std(axis=0, ddof=1)

but I'm not able to apply this on each "name" group of my dataframe

SO the result I want is:

name, value1, value2
a     -0.5     0.5
a      0.5    -0.5
b     -0.5    -1
b      0.5     1

I tried groupby and a multiindex data frame but probably I'm not doing it in the right way

Was it helpful?

Solution

Normalizing by group is one of the examples in the groupby documentation. But it doesn't do exactly what you seem to want here.

In [2]: d.groupby('name').transform(lambda x: (x-x.mean())/x.std(ddof=1))
Out[2]: 
     value1    value2
0 -0.707107  0.707107
1  0.707107 -0.707107
2 -0.707107 -0.707107
3  0.707107  0.707107

Your desired result suggests that you actually want to normalize the values in each name group with reference to the elements in value1 and value2. For something like that, you can apply a function to each group individually, and reassemble the result.

In [3]: def normalize(group):                                                      
    mean = group.values.ravel().mean()
    std = group.values.ravel().std(ddof=1)
    return group.applymap(lambda x: (x - mean)/std)
   ....: 

In [4]: pd.concat([normalize(group) for _, group in d.set_index('name').groupby(level=0)])
Out[4]: 
        value1    value2
name                    
a    -1.224745  1.224745
a     0.000000  0.000000
b    -0.660338 -0.660338
b    -0.132068  1.452744

OTHER TIPS

Are you sure the result you gave is the correct one? I'm assuming you want to normalize value1 and value2 separately. If that's not correct, let me know.

#  Easier with `name` as the index.

In [65]: d = d.set_index('name')

In [66]: d
Out[66]: 
      value1  value2
name                
a          1       3
a          2       2
b          4       4
b          5       8

In [68]: means = g.mean()

In [69]: stds = g.std()

In [70]: means
Out[70]: 
      value1  value2
name                
a        1.5     2.5
b        4.5     6.0

In [71]: stds
Out[71]: 
        value1    value2
name                    
a     0.707107  0.707107
b     0.707107  2.828427

In [76]: g.transform(lambda x: (x - means) / stds)
Out[76]: 
        value1    value2
name                    
a    -0.707107  0.707107
a     0.707107 -0.707107
a          NaN       NaN
b          NaN       NaN
b    -0.707107 -0.707107
b     0.707107  0.707107

# Get rid of the nans

In [77]: g.transform(lambda x: (x - means) / stds).dropna()
Out[77]: 
        value1    value2
name                    
a    -0.707107  0.707107
a     0.707107 -0.707107
b    -0.707107 -0.707107
b     0.707107  0.707107
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top