Trying to speed up python code by replacing loops with functions

  •  | 
  •   ( words)


I am trying to come up with a faster way of coding what I want to. Here is the part of my program I am trying to speed up, hopefully using more inbuilt functions:

num = 0
num1 = 0
rand1 = rand_pos[0:10]
time1 = time.clock() 
for rand in rand1:   
     for gal in gal_pos:
         num1 = dist(gal, rand)
         num = num + num1 
time2 = time.clock()
time_elap = time2-time1
print time_elap

Here, rand_pos and gal_pos are lists of length 900 and 1 million respectively. Here dist is function where I calculate the distance between two points in euclidean space. I used a snippet of the rand_pos to get a time measurement. My time measurements are coming to be about 125 seconds. This is way too long! It means that if I run the code over all the rand_pos, it will take about three hours to do! Is there a faster way I can do this?

Here is the dist function:

def dist(pos1,pos2):
    n = 0
    dist_x = pos1[0]-pos2[0]
    dist_y = pos1[1]-pos2[1]
    dist_z = pos1[2]-pos2[2]
    if dist_x<radius and dist_y<radius and dist_z<radius:
        positions = [pos1,pos2]
        distance = scipy.spatial.distance.pdist(positions, metric = 'euclidean')
        if distance<radius:
            n = 1       
return n

No correct solution


While most of the optimization probably needs to happen within your dist function, there are some tips here to speed things up:

# Don't manually sum
for rand in rand1:
    num += sum([dist(gal, rand) for gal in gal_pos])

#If you can vectorize something, then do
import numpy as np
new_dist = np.vectorize(dist)
for rand in rand1:
    num += np.sum(new_dist(gal_pos, rand))

# use already-built code whenever possible (as already suggested)
scipy.spatial.distance.cdist(gal, rand1, metric='euclidean')

There is a function in scipy that does exactly what you want to do here:

scipy.spatial.distance.cdist(gal, rand1, metric='euclidean')

It will be faster than anything you write in pure Python probably, since the heavy lifting (looping over the pairwise combinations between arrays) is implemented in C.

Currently your loop is happening in Python, which means there is more overhead per iteration, then you are making many calls to pdist. Even though pdist is very optimized, the overhead of making so many calls to it slows down your code. This type of performance issue was once described to me with a very useful analogy: its like trying to have a conversation with someone over the phone by saying one word per phone call, even though each word is going across the line very fast, your conversation will take a long time because you need to hang up and dial again repeatedly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow