Question

X is a text file that contains 100000 equal size (500 elements) bit vector (i.e. each row is a vector of 500 elements). I am generating an adjacency matrix (100000 X 100000) using the code below, but its not optimized and very time consuming. How can I improve that.

import numpy as np
import scipy.spatial.distance


 readFrom = "vector.txt"
 fout = open("adjacencymatrix.txt","a")

 X = np.genfromtxt(readFrom, dtype=None) 

 for outer in range(0,100000):
    for inner in range(0,100000):
        dis = scipy.spatial.distance.euclidean(X[outer],X[inner])
        tmp += str(dis)+" "
    tmp += "\n"        
    fout.write(tmp)
 fout.close()

Thank you.

Was it helpful?

Solution

Edit: Complete rewrote after understanding the question better. Given the size of the data, etc. this one is tricky. I got my best results at speedup with the following so far:

import time
import numpy as np
from scipy import spatial
import multiprocessing as mp

pool = mp.Pool(4)

test_data = np.random.random(100000*500).reshape([100000,500])

outfile = open('/tmp/test.out','w')

def split(data,size):
    for i in xrange(0, len(data), size):
        yield data[i:i+size]

def distance(vecs):
    return spatial.distance.cdist(vecs,test_data)

chunks = list(split(test_data,100))
for chunk in chunks:
    t0 = time.time()
    distances = spatial.distance.cdist(chunk,test_data)
    outfile.write(' '.join([str(x) for x in distances]))
    print 'estimated: %.2f secs'%((time.time()-t0)*len(chunks))

So I tried balancing the size of each chunk of the dataset vs. the memory overhead. This got me down to an estimated 6,600 secs to finish, or ~110 mins. You can see I also started seeing if I could parallelize using the multiprocessing pool. My strategy would have been to asynchronously process each chunk and save them to a different text file, then concatenate the files aftwerwards, but I got to get back to work.

OTHER TIPS

Some little optimizations over your code (and I'm assuming that you're using Python 2.x):

import numpy as np
import scipy.spatial.distance

X = np.genfromtxt("vector.txt", dtype=None) 
fout = open("adjacencymatrix.txt", "a")

for outer in xrange(0, 100000):
  fout.write(" ".join(str(scipy.spatial.distance.euclidean(X[outer], X[inner])) for inner in xrange(0, 100000)) + "\n")

fout.close()

I wouldn't recommend precomputing the whole matrix before writing it - although doing so would allow us to exploit the simmetry of the problem and iterate over only half of the elements, but it would consume a lot of memory. I'm sticking with what you had - each line is written as soon as is calculated.

The real problem here is that the input data is huge, the distance calculation will be executed 100,000 x 100,000 = 10,000'000,000 times, and no amount of micro-optimizations will change that. Are you sure that you have to calculate the whole matrix?

(If you're using Python 2.x, use xrange instead of range.)

To compute, you could use:

diff_matrix = numpy.subtract.outer(X, X)
result = numpy.sqrt(numpy.abs(diff_matrix))
# output the result.

Note that to store a 100,000 × 100,000 matrix of double you'll need 74.5 GB of memory, and maybe double that for the filesize of the text output. Do you really need the whole matrix? (You may also parallelize the computation, but that would need more than numpy.)

I have a hunch that the distance matrix might be calculated without explicit python loops by using matrix operations.

The outer product of X with its transposed seems promising, as it performs the inner product of each pair of vectors and leaves the result in each cell of the resulting 100.000 x 100.000 matrix, and the inner product is closely related with the euclidean distance (or its square).

So I guess it's a matter of tweaking that to get the euclidean distance between the two vectors rather than the inner product. My instinct tells me that the complex numbers might be useful here.

Maybe some brighter mind could throw some light up here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top