Domanda

I have a multiindex dataframe with sales data for different regions, sizes, and dates. I want to calculate the "worldwide" (over all regions) sum of sales, by size, for each date, then assign that to a column in the original dataframe, with each worldwide value for sales and size broadcast to every region. I thought I could group by size and date, then cumsum() and then take the answer and reindex it to the original dataframe, but that doesn't seem to work.

Here's the setup code:

import pandas as pd
#Create the dataframe 'df'
regions=['NorthAm']*9
regions.extend(['APAC']*9)
regions.extend(['Eur']*9)
sizes=[12]*3
sizes.extend([14]*3)
sizes.extend([16]*3)
sizes=sizes*3
dates=['1/1/2011','1/2/2011','1/3/2011']*27
idx=zip(regions,sizes,dates)
idx=pd.MultiIndex.from_tuples(idx, names=['Region','Size','Date'])
df=pd.DataFrame(np.arange(27), index = idx, columns=['Sales']) 

# Check it
df

Out[1]: 
                       Sales
Region  Size Date           
NorthAm 12   1/1/2011      0
             1/2/2011      1
             1/3/2011      2
        14   1/1/2011      3
             1/2/2011      4
             1/3/2011      5
        16   1/1/2011      6
             1/2/2011      7
             1/3/2011      8
APAC    12   1/1/2011      9
             1/2/2011     10
             1/3/2011     11
        14   1/1/2011     12
             1/2/2011     13
             1/3/2011     14
        16   1/1/2011     15
             1/2/2011     16
             1/3/2011     17
Eur     12   1/1/2011     18
             1/2/2011     19
             1/3/2011     20
        14   1/1/2011     21
             1/2/2011     22
             1/3/2011     23
        16   1/1/2011     24
             1/2/2011     25
             1/3/2011     26

# Yes, that's right. Now create the cumulative sum, regardless of region
cs=df.groupby(level=['Size','Date']).sum().groupby(level=0).cumsum()

# Check it.
cs

Out[1]: 
               Sales
Size Date           
12   1/1/2011     27
     1/2/2011     57
     1/3/2011     90
14   1/1/2011     36
     1/2/2011     75
     1/3/2011    117
16   1/1/2011     45
     1/2/2011     93
     1/3/2011    144

Now I'd like to do something like:

df['WWSales']=cd.reindex(df, method='???')

to get something like:

Out[2]:
                       Sales    WWSales
Region  Size Date           
NorthAm 12   1/1/2011      0    27
             1/2/2011      1    57
             1/3/2011      2    90
        14   1/1/2011      3    36
             1/2/2011      4    75
             1/3/2011      5   115
        16   1/1/2011      6    45
             1/2/2011      7    93
             1/3/2011      8   144
APAC    12   1/1/2011      9    27
             1/2/2011     10    57
             1/3/2011     11    90
        14   1/1/2011     12    36
             1/2/2011     13    75
             1/3/2011     14   115
        16   1/1/2011     15    45
             1/2/2011     16    93
             1/3/2011     17   144
Eur     12   1/1/2011     18    27
             1/2/2011     19    57
             1/3/2011     20    90
        14   1/1/2011     21    36
             1/2/2011     22    75
             1/3/2011     23   115
        16   1/1/2011     24    45
             1/2/2011     25    93
             1/3/2011     26   144

I suspect there's some really obvious solution, but I'm not seeing it (and not finding it in searches). Any help would be greatly appreciated.

By the way, bonus points for an elegant way to make the date column into a DateTimeIndex. The following works, but seems inelegant:

df.index.levels[2]=pd.tseries.period.DatetimeIndex(df.index.levels[2])
È stato utile?

Soluzione

Instead of (implicitly) applying sum and cumsum, use the transform method on each group. That is:

df['WWSales'] = df.groupby(level=['Size','Date']).transform(np.sum).groupby(level=0).transform(np.cumsum)

That gives your desired output.

See http://pandas.pydata.org/pandas-docs/dev/groupby.html#transformation

I'll take a crack at your bonus question: you could map Timestamp onto your list.

 dates = map(pd.Timestamp, ['1/1/2011','1/2/2011','1/3/2011']*27)

But I expect there is a neater way -- I cede that question to more experienced voices....

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top