“embarrassingly parallel” programming using python and PBS on a cluster

https://stackoverflow.com/questions/3307948

26-09-2020
|

Question

I have a function (neural network model) which produces figures. I wish to test several parameters, methods and different inputs (meaning hundreds of runs of the function) from python using PBS on a standard cluster with Torque.

Note: I tried parallelpython, ipython and such and was never completely satisfied, since I want something simpler. The cluster is in a given configuration that I cannot change and such a solution integrating python + qsub will certainly benefit to the community.

To simplify things, I have a simple function such as:

import myModule
def model(input, a= 1., N=100):
    do_lots_number_crunching(input, a,N)
    pylab.savefig('figure_' + input.name + '_' + str(a) + '_' + str(N) + '.png')

where input is an object representing the input, input.name is a string, anddo_lots_number_crunching may last hours.

My question is: is there a correct way to transform something like a scan of parameters such as

for a in pylab.linspace(0., 1., 100):
    model(input, a)

into "something" that would launch a PBS script for every call to the model function?

#PBS -l ncpus=1
#PBS -l mem=i1000mb
#PBS -l cput=24:00:00
#PBS -V
cd /data/work/
python experiment_model.py

I was thinking of a function that would include the PBS template and call it from the python script, but could not yet figure it out (decorator?).

Solution

pbs_python[1] could work for this. If experiment_model.py 'a' as an argument you could do

import pbs, os

server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)

attopl = pbs.new_attropl(4)
attropl[0].name  = pbs.ATTR_l
attropl[0].resource = 'ncpus'
attropl[0].value = '1'

attropl[1].name  = pbs.ATTR_l
attropl[1].resource = 'mem'
attropl[1].value = 'i1000mb'

attropl[2].name  = pbs.ATTR_l
attropl[2].resource = 'cput'
attropl[2].value = '24:00:00'

attrop1[3].name = pbs.ATTR_V

script='''
cd /data/work/
python experiment_model.py %f
'''

jobs = []

for a in pylab.linspace(0.,1.,100):
    script_name = 'experiment_model.job' + str(a)
    with open(script_name,'w') as scriptf:
        scriptf.write(script % a)
    job_id = pbs.pbs_submit(c, attropl, script_name, 'NULL', 'NULL')
    jobs.append(job_id)
    os.remove(script_name)

 print jobs

[1]: https://oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage pbs_python

OTHER TIPS

You can do this easily using jug (which I developed for a similar setup).

You'd write in file (e.g., model.py):

@TaskGenerator
def model(param1, param2):
     res = complex_computation(param1, param2)
     pyplot.coolgraph(res)


for param1 in np.linspace(0, 1.,100):
    for param2 in xrange(2000):
        model(param1, param2)

And that's it!

Now you can launch "jug jobs" on your queue: jug execute model.py and this will parallelise automatically. What happens is that each job will in, a loop, do something like:

while not all_done():
    for t in tasks in tasks_that_i_can_run():
        if t.lock_for_me(): t.run()

(It's actually more complicated than that, but you get the point).

It uses the filesystem for locking (if you're on an NFS system) or a redis server if you prefer. It can also handle dependencies between tasks.

This is not exactly what you asked for, but I believe it's a cleaner architechture to separate this from the job queueing system.

It looks like I'm a little late to the party, but I also had the same question of how to map embarrassingly parallel problems onto a cluster in python a few years ago and wrote my own solution. I recently uploaded it to github here: https://github.com/plediii/pbs_util

To write your program with pbs_util, I would first create a pbs_util.ini in the working directory containing

[PBSUTIL]
numnodes=1
numprocs=1
mem=i1000mb
walltime=24:00:00

Then a python script like this

import pbs_util.pbs_map as ppm

import pylab
import myModule

class ModelWorker(ppm.Worker):

    def __init__(self, input, N):
        self.input = input
        self.N = N

    def __call__(self, a):
        myModule.do_lots_number_crunching(self.input, a, self.N)
        pylab.savefig('figure_' + self.input.name + '_' + str(a) + '_' + str(self.N) + '.png')



# You need  "main" protection like this since pbs_map will import this file on the     compute nodes
if __name__ == "__main__":
    input, N = something, picklable
    # Use list to force the iterator
    list(ppm.pbs_map(ModelWorker, pylab.linspace(0., 1., 100),
                     startup_args=(input, N),
                     num_clients=100))

And that would do it.

I just started working with clusters and EP applications. My goal (I'm with the Library) is to learn enough to help other researchers on campus access HPC with EP applications...especially researchers outside of STEM. I'm still very new, but thought it may help this question to point out the use of GNU Parallel in a PBS script to launch basic python scripts with varying arguments. In the .pbs file, there are two lines to point out:

module load gnu-parallel # this is required on my environment

parallel -j 4 --env PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
--workdir $NODE_LOCAL_DIR --transfer --return 'output.{#}' --clean \
`pwd`/simple.py '{#}' '{}' ::: $INPUT_DIR/input.*

# `-j 4` is the number of processors to use per node, will be cluster-specific
# {#} will substitute the process number into the string
# `pwd`/simple.py `{#}` `{}`   this is the command that will be run multiple times
# ::: $INPUT_DIR/input.* all of the files in $INPUT_DIR/ that start with 'input.' 
#     will be substituted into the python call as the second(3rd) argument where the
#     `{}` resides.  These can be simple text files that you use in your 'simple.py'
#     script to pass the parameter sets, filenames, etc.

As a newby to EP supercomputing, even though I don't yet understand all the other options on "parallel", this command allowed me to launch python scripts in parallel with different parameters. This would work well if you can generate a slew of parameter files ahead of time that will parallelize your problem. For example, running simulations across a parameter space. Or processing many files with the same code.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow