Question

I want to calculate a statistic over all pairwise combinations of the columns of a very large matrix. I have a python script, called jaccard.py that accepts a pair of columns and computes this statistic over the matrix.

On my work machine, each calculation takes about 10 seconds, and I have about 95000 of these calculations to complete. However, all these calculations are independent from one another and I am looking to use a cluster we have that uses the Torque queueing system and python2.4. What's the best way to parallelize this calculation so it's compatible with Torque?

I have made the calculations themselves compatible with python2.4, but I am at a loss how to parallelize these calculations using subprocess, or whether I can even do that because of the GIL.

The main idea I have is to keep a constant pool of subprocesses going; when one finishes, read the output and start a new one with the next pair of columns. I only need the output once the calculation is finished, then the process can be restarted on a new calculation.

My idea was to submit the job this way

qsub -l nodes=4:ppn=8 myjob.sh > outfile

myjob.sh would invoke a main python file that looks like the following:

import os, sys
from subprocess import Popen, PIPE
from select import select

def combinations(iterable, r):
    #backport of itertools combinations
    pass

col_pairs = combinations(range(598, 2))

processes = [Popen(['./jaccard.py'] + map(str, col_pairs.next()), 
                   stdout=PIPE)
             for _ in range(8)]

try:
    while 1:
        for p in processes:
            # If process has completed the calculation, print it out
            # **How do I do this part?**

            # Delete the process and add a new one
            p.stdout.close()
            processes.remove(p)
            process.append(Popen(['./jaccard.py'] + map(str, col_pairs.next()),
                                 stdout=Pipe))

# When there are no more column pairs, end the job.
except StopIteration:
    pass

Any advice on to how to best do this? I have never used Torque and am unfamiliar with subprocessing in this way. I tried using multiprocessing.Pool on my workstation and it worked flawlessly with Pool.map, but since the cluster uses python2.4, I'm not sure how to proceed.

EDIT: Actually, on second thought, I could just write multiple qsub scripts, each only working on a single chunk of the 95000 calculations. I could submit something like 16 different jobs, each doing 7125 calculations. It's essentially the same thing.

Was it helpful?

Solution

Actually, on second thought, I could just write multiple qsub scripts, each only working on a single chunk of the 95000 calculations. I could submit something like 16 different jobs, each doing 7125 calculations. It's essentially the same thing. This isn't a solution, but it's a suitable workaround given time and effort constraints.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top