Question

In numpy, what's the most efficient way to compute x.T * x, where x is a large (200,000 x 1000) dense float32 matrix and .T is the transpose operator?

For the avoidance of doubt, the result is 1000 x 1000.

edit: In my original question I stated that np.dot(x.T, x) was taking hours. It turned out that I had some NaNs sneak into the matrix, and for some reason that was completely killing the performance of np.dot (any insights as to why?) This is now resolved, but the original question stands.

Was it helpful?

Solution

This may not be the answer you're looking for, but one way to speed it up considerably is to use a gpu instead of your cpu. If you have a decently powerful graphics card around, it'll outperform your cpu any day, even if your system is very well tuned.

For nice integration with numpy, you could use theano (if your graphics card is made by nvidia). The calculation in the following code runs for me in couple of seconds (although I have a very powerful graphics card):

$ THEANO_FLAGS=device=gpu0 python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
Using gpu device 0: GeForce GTX 480
>>> from theano import tensor as T
>>> import numpy
>>> x = numpy.ones((200000, 1000), dtype=numpy.float32)
>>> m = T.matrix() 
>>> mTm = T.dot(m.T, m)
>>> f = theano.function([m], mTm)
>>> f(x)
array([[ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       ..., 
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.]], dtype=float32)
>>> r = f(x)
>>> r.shape
(1000, 1000)

I was going to wait to find out how long >>> numpy.dot(x.T, x) took by way of comparison, but I got bored...

You can also try PyCuda or PyOpenCL (if you don't have an nvidia graphics card), although I don't know if their numpy support is as straightforward.

OTHER TIPS

First, make sure you use an optimized blas/lapack, this can make a tremendous difference (up to one order of magnitude). If you use a threaded ATLAS, for example, it will use all your cores relatively efficiently (you need to use a recent ATLAS, though, and compiling ATLAS is a PITA).

As for why Nan slows everything done: that's pretty much unavoidable, NaN handling is much slower than "normal" float at the CPU level: http://www.cygnus-software.com/papers/x86andinfinity.html. It depends on the CPU model, what kind of instruction set you are using, and of course the algorithms/implementation you are using.

hmm, x is about 800 Mb, assuming it needs the same for the result, are you sure you have enough physical memory and it's not swapping?

other than that, numpy should use a BLAS function, and even though the default library that numpy uses may be relatively slow, it should work ok for this size.

edit

import numpy as npy
import time

def mm_timing():
  print "   n   Gflops/s"
  print "==============="
  m = 1000
  n = 200000
  a = npy.random.rand(n, m)
  flops = (2 * float(n) - 1) * float(m)**2
  t1 = time.time()
  c = npy.dot(a.T, a)
  t2 = time.time()
  perf = flops / (t2 - t1) / 1.e9
  print "%4i" % n + "     " + "%6.3f" % perf

mm_timing()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top