Question

I am trying to optimize some python code (to speed up some matrix operations), my code is something similar to this one (my real dataset is also similar to 'gps'),

import numpy as np
gps = [np.random.rand(50,50) for i in xrange(1000)]
ips = np.zeros( (len(gps),len(gps)), dtype='float32')

for i in xrange(len(gps)):
  for j in xrange(0,i+1):
    ips[i,j]= f.innerProd(gps[i],gps[j])
    ips[j,i]= ips[i,j]
   print "Inner product matrix: %3.0f %% done (%d of %d)"%  \
               (((i+1)**2.)/(len(gps)**2.)*100, i, len(gps))

def innerProd(mat1,mat2):
    return float(np.sum(np.dot(np.dot(mat1,mat2),mat1)))

What I would like to understand is , why is it that the program begins running fast during the first iterations and then slows down as it iterates further? I know the question might be a bit naive but I really want to have a clearer idea of what is happening before I attempt anything else. I already implemented my function in Fortran (leaving within the Fortran realm any for loops) and used f2py to create a dynamic lib to call the function from python, this would be the new code in python..

import numpy as np
import myfortranInnProd as fip

gps = [np.random.rand(50,50) for i in xrange(1000)]
ips = np.zeros( (len(gps),len(gps)), dtype='float32')

ips = fip.innerProd(gps)

unfortunately I only found out (surprisingly) that my fortran-python version runs 1.5 ~ 2 times slower than the first version (it is important to mention that I used MATMUL() on the Fortran implementation). I have been googling around for a while and I believe that this "slow down" has something to do with the memory bandwidth, memory allocation or caching, given the large datasets, but I am not very sure about what is really happening behind and how could I improve the performance. I have run the code on both a small intel atom , 2GB ram and a 4 core intel xeon, with 8GB (of course with a correspondingly scaled dataset) and the "slow down" behavior is the same.

I just need to understand why is it that this 'slow down' happens? would it do any good if i implement the function in C ? or try to implement it to run on a GPU ? Any other ideas how to improve it? Thanks in advance

Was it helpful?

Solution

At the risk of stating the obvious, the number of executions of the inner loop will grow each time you complete an execution of the outer loop. When i is 0, the inner loop will only be executed once, but when i is 100, it will be executed 101 times. Could this explain your observations, or do you mean that each execution of the inner loop itself is getting slower over time?

OTHER TIPS

The number of executions of the inner for loop depends on the value of i, the index of the outer for loop. Since you're displaying your debug each time the inner loop finishes, it gets displayed less and less often as i grows. (Note that the percentage increases regularly, however.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top