Question

This is a question about how to do a very large number of table joins for the purpose of doing some vector math in Pandas.

So through a VERY, VERY long processing chain, I have boiled a huge amount of data represented as HDF5 tables into a set of about 20 sparse vectors, represented as Pandas DataFrames with string-based MultiIndexes. The space in which these vectors reside is very complicated and high-dimensional (it's natural language data), but they overlap somewhat. The dimensions themselves have a hierarchy (hence the MultiIndex). Say they have about 5K-60K dimensions each, and the total number of overlapping dimensions (which can differ depending on the 20 I call up) are about 200K. (The FULL space has FAR more than 200K dimensions in it!)

Up to here it's very fast, with a one-time cost of processing the tables into the right kind of vectors.

But now I want to align and sum these vectors. All of the solutions I've found are rather slow. I am using Pandas 0.12.0 on Python 2.7.

Let A be the store/on-disk has from which I am getting the vectors.

In [106]: nounlist = ["fish-n", "bird-n", "ship-n", "terror-n", "daughter-n", "harm-n", "growth-n", "reception-n", "antenna-n", "bank-n", "friend-n", "city-n", "woman-n", "weapon-n", "politician-n", "money-n", "greed-n", "law-n", "sympathy-n", "wound-n"]

In [107]: matrices = [A[x] for x in nounlist]

(matrices is a bit misleading I recognize after the fact. Aside from the MultiIndex, they're a single column.)

So far so good. But now I want to join them so that I can sum them:

In [108]: %timeit matrices[0].join(matrices[1:], how="outer")
1 loops, best of 3: 18.2 s per loop

This is on a relatively recent processor (2.7 GHz AMD Opteron). It's far too slow for something that ideally would be used (at high dimensionality) in a speech-processing system.

I get a bit better luck with reduce:

In [109]: %timeit reduce(lambda x, y: x.join(y, how="outer"), matrices[1:], matrices[0])
1 loops, best of 3: 10.8 s per loop

These stay pretty consistent across runs. Once it returns, the summing is at an acceptable speed:

In [112]: vec = reduce(lambda x, y: x.join(y, how="outer"), matrices[1:], matrices[0])

In [113]: %timeit vec.T.sum()
1 loops, best of 3: 262 ms per loop

The closest I've come to getting it down to a reasonable time is this:

def dictcutter(mlist):
    rlist = [x.to_dict()[x.columns[0]] for x in mlist]
    mdict = {}
    for r in rlist:
        for item in r:
            mdict[item] = mdict.get(item, 0.0) + r[item]
    index = pd.MultiIndex.from_tuples(mdict.keys())
    return pd.DataFrame(mdict.values(), index=index)

This runs like:

In [114]: %timeit dictcutter(matrices)
1 loops, best of 3: 3.13 s per loop

But every second counts! Is there a way to cut it down even further? Is there a smarter way to add these vectors by dimension?

EDITED TO ADD details requested by Jeff in comments:

Some details about the vector for "fish-n":

In [14]: vector = A['fish-n']

In [15]: vector.head()
Out[15]: 
                   fish-n
link   word1             
A2     give-v  140.954675
A4     go-v    256.313976
AM-CAU go-v      0.916041
AM-DIR go-v     29.022072
AM-MNR go-v     21.941577

In [16]: vector.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 5424 entries, (A2, give-v) to (A1, gotta-v)
Data columns (total 1 columns):
fish-n    5424  non-null values
dtypes: float64(1)

Drilling deeper:

In [17]: vector.loc['A0']
Out[17]: 
<class 'pandas.core.frame.DataFrame'>
Index: 1058 entries, isolate-v to overdo-v
Data columns (total 1 columns):
fish-n    1058  non-null values
dtypes: float64(1)

In [18]: vector.loc['A0'][500:520]
Out[18]: 
                 fish-n
word1                  
whip-v         3.907307
fake-v         0.117985
sip-v          0.579624
impregnate-v   0.885079
flavor-v       5.583664
inspire-v      2.251709
pepper-v       0.967941
overrun-v      1.435597
clutch-v       0.140110
intercept-v   20.513823
refined-v      0.738980
gut-v          7.570856
ascend-v      12.686698
submerge-v     1.761342
catapult-v     0.577075
cleaning-v     1.492284
floating-v     5.318519
incline-v      2.270102
plummet-v      0.243116
propel-v       3.957041

Now multiply that by 20 and try and sum them all...

Was it helpful?

Solution

Create some test data

In [66]: def mklbl(prefix,n):
   ....:         return ["%s%s" % (prefix,i)  for i in range(n)]
   ....: 

In [67]: mi_total = pd.MultiIndex.from_product([mklbl('A',1000),mklbl('B',200)])

# note that these are random consecutive slices; that's just for illustration
In [68]: ms = [ pd.Series(1,index=mi_total.take(np.arange(50000)+np.random.randint(0,150000,size=1))) for i in range(20) ]

In [69]: ms[0]
Out[69]: 
A417  B112    1
      B113    1
      B114    1
      B115    1
      B116    1
      B117    1
      B118    1
      B119    1
      B120    1
      B121    1
      B122    1
      B123    1
      B124    1
      B125    1
      B126    1
...
A667  B97     1
      B98     1
      B99     1
      B100    1
      B101    1
      B102    1
      B103    1
      B104    1
      B105    1
      B106    1
      B107    1
      B108    1
      B109    1
      B110    1
      B111    1
Length: 50000, dtype: int64

Shove everything into a really long series, convert to a frame (with the same index, which is duplicated at this point), then sum up on the index levels (which de-duplicates)

This is equivalent to a concat(ms).groupby(level=[0,1]).sum(). (the sort at the end is just for illustration and not necessary). though you prob want to sortlevel() to sort the index if you are doing any types of indexing after.

 In [103]: concat(ms).to_frame(name='value').sum(level=[0,1]).sort('value',ascending=False)
Out[103]: 
           value
A596 B109     14
A598 B120     14
     B108     14
     B109     14
     B11      14
     B110     14
     B111     14
     B112     14
     B113     14
     B114     14
     B115     14
     B116     14
     B117     14
     B118     14
     B119     14
     B12      14
     B121     14
     B106     14
     B122     14
     B123     14
     B124     14
     B125     14
     B126     14
     B127     14
     B128     14
     B129     14
     B13      14
     B130     14
     B131     14
     B132     14
     B133     14
     B134     14
     B107     14
     B105     14
     B136     14
A597 B91      14
     B79      14
     B8       14
     B80      14
     B81      14
     B82      14
     B83      14
     B84      14
     B85      14
     B86      14
     B87      14
     B88      14
     B89      14
     B9       14
     B90      14
     B92      14
A598 B104     14
A597 B93      14
     B94      14
     B95      14
     B96      14
     B97      14
     B98      14
     B99      14
A598 B0       14
             ...

[180558 rows x 1 columns]

Pretty fast now

In [104]: %timeit concat(ms).to_frame(name='value').sum(level=[0,1]).sort('value',ascending=False)
1 loops, best of 3: 342 ms per loop
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top