Numpy rewriting operation using einsum

https://stackoverflow.com/questions/21172482

28-09-2022
|

Вопрос

I am trying to implement PCA in python. Currently I am using this code to represent the data back into the initial dimensions from the low dimensional data and the principal components:

sameDimRepresentation = lowDimRepresentation[:, np.newaxis] * principalComponents.T
sameDimRepresentation = sameDimRepresentation.sum(axis=2)

What the code does:

for each row in lowDimRepresentation it computes the product between each element of the row (seen as a scalar) and each of the row vectors of principal components (column vectors of principalComponents.T) and then it sums all these product vectors up (line 2)

lowDimRepresentation: an array of x by 100 
principalComponents: an array of 100 by 784

resulting array: x by 784

This method works fine when using x = 10000 but after that I get a memory error.

I know einsum is more memory efficient, I was trying to rewrite the same code with it but I did not manage.

Can someone help me with that?

Worst case I just split the 60000 cases into batches of 10000 and I run those bits, but I was hoping to achieve something more elegant.

Thanks a lot!

Решение

So there's good news and there's bad news. The good news is that the einsum version is very simple:

np.einsum('ij,jl->il', lowDimRepresentation, principalComponents)

For example:

>>> import numpy as np
>>> x = 1000
>>> lowDimRepresentation = np.random.random((x, 100))
>>> principalComponents = np.random.random((100, 784))
>>> sameDimRepresentation = (lowDimRepresentation[:, np.newaxis] * principalComponents.T).sum(axis=2)
>>> esum_same = np.einsum('ij,jl->il', lowDimRepresentation, principalComponents)
>>> np.allclose(sameDimRepresentation, esum_same)
True

This should also be a little faster:

>>> %timeit sameDimRepresentation = (lowDimRepresentation[:, np.newaxis] * principalComponents.T).sum(axis=2)
1 loops, best of 3: 1.12 s per loop
>>> %timeit esum_same = np.einsum('ij,jl->il', lowDimRepresentation, principalComponents)
10 loops, best of 3: 88.7 ms per loop

The bad news is that when I try applying it to the x=60000 case:

>>> esum_same = np.einsum('ij,jl->il', lowDimRepresentation, principalComponents)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: iterator is too large

So I'm not sure whether it'll actually help with your real problem..

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow