fast sparse vector addition in pandas

Question

Create some test data

In [66]: def mklbl(prefix,n):
   ....:         return ["%s%s" % (prefix,i)  for i in range(n)]
   ....: 

In [67]: mi_total = pd.MultiIndex.from_product([mklbl('A',1000),mklbl('B',200)])

# note that these are random consecutive slices; that's just for illustration
In [68]: ms = [ pd.Series(1,index=mi_total.take(np.arange(50000)+np.random.randint(0,150000,size=1))) for i in range(20) ]

In [69]: ms[0]
Out[69]: 
A417  B112    1
      B113    1
      B114    1
      B115    1
      B116    1
      B117    1
      B118    1
      B119    1
      B120    1
      B121    1
      B122    1
      B123    1
      B124    1
      B125    1
      B126    1
...
A667  B97     1
      B98     1
      B99     1
      B100    1
      B101    1
      B102    1
      B103    1
      B104    1
      B105    1
      B106    1
      B107    1
      B108    1
      B109    1
      B110    1
      B111    1
Length: 50000, dtype: int64

Shove everything into a really long series, convert to a frame (with the same index, which is duplicated at this point), then sum up on the index levels (which de-duplicates)

This is equivalent to a concat(ms).groupby(level=[0,1]).sum(). (the sort at the end is just for illustration and not necessary). though you prob want to sortlevel() to sort the index if you are doing any types of indexing after.

 In [103]: concat(ms).to_frame(name='value').sum(level=[0,1]).sort('value',ascending=False)
Out[103]: 
           value
A596 B109     14
A598 B120     14
     B108     14
     B109     14
     B11      14
     B110     14
     B111     14
     B112     14
     B113     14
     B114     14
     B115     14
     B116     14
     B117     14
     B118     14
     B119     14
     B12      14
     B121     14
     B106     14
     B122     14
     B123     14
     B124     14
     B125     14
     B126     14
     B127     14
     B128     14
     B129     14
     B13      14
     B130     14
     B131     14
     B132     14
     B133     14
     B134     14
     B107     14
     B105     14
     B136     14
A597 B91      14
     B79      14
     B8       14
     B80      14
     B81      14
     B82      14
     B83      14
     B84      14
     B85      14
     B86      14
     B87      14
     B88      14
     B89      14
     B9       14
     B90      14
     B92      14
A598 B104     14
A597 B93      14
     B94      14
     B95      14
     B96      14
     B97      14
     B98      14
     B99      14
A598 B0       14
             ...

[180558 rows x 1 columns]

Pretty fast now

In [104]: %timeit concat(ms).to_frame(name='value').sum(level=[0,1]).sort('value',ascending=False)
1 loops, best of 3: 342 ms per loop