So I have this dataset containing sales data for N
items between day d1
and day d2
. For each purchase, I have the time stamp, the customer ID and the item ID. My goal is to generate a dataframe of size (M x N)
, where df[ i, j ]
is the total number of purchases for item j
during month i
.
Generate mockup data
import pandas as pd
import random
d1 = '2014-1-1'
d2 = '2014-3-31'
daily = pd.date_range( d1, d2, freq='D' )
npurchase = 1000
nitem = 20
olddf = pd.DataFrame( { 'dt': [ random.choice( daily ) for _ in xrange( npurchase) ], 'itemID': [ randint(nitem) for _ in xrange( npurchase ) ] } )
olddf.head()
Output:
dt itemID
0 2014-02-24 00:00:00 19
1 2014-01-29 00:00:00 0
2 2014-01-27 00:00:00 7
3 2014-02-03 00:00:00 12
4 2014-01-24 00:00:00 3
Resample and align
rng = pd.date_range( d1, d2, freq='M')
newdf = pd.DataFrame( index=rng )
for name, group in olddf.groupby( 'itemID' ) :
tmp = group.groupby( 'dt' ).size().resample( 'M', how='sum' )
newdf[ name ] = tmp
newdf.fillna( 0, inplace=True )
newdf.ix[ :, :5 ]
Output
0 1 2 3 4 5
2014-01-31 15 21 25 17 10 14
2014-02-28 10 13 16 20 15 8
2014-03-31 12 25 14 14 26 12
Is there a more efficient / elegant way to do it?