Question

I have the situation where I am doing some computation in Python, and based on the outcomes I have a list of target files that are candidates to be passed to 2nd program.

For example, I have 50,000 files which contain ~2000 items each. I want to filter for certain items and call a command line program to do some calculation on some of those.

This Program #2 can be used via shell command line, but requires also a lengthy set of arguments. Because of performance reasons I would have to run Program #2 on a cluster.

Right now, I am running Program #2 via 'subprocess.call("...", shell=True) But I'd like to run it via qsub in future. I have not much experience of how exactly this could be done in a reasonably efficient manner.

Would it make sense to write temporary 'qsub' files and run them via subprocess() directly from the Python script? Is there a better, maybe more pythonic solution?

Any ideas and suggestions are very welcome!

Was it helpful?

Solution

It makes perfect sense, although I would go for another solution.

As far as I understand, you have programme #1 that determines which of your 50,000 files needs to be computed by programme #2. Both programme #1 and #2 are written in Python. Excellent choice.

Incidentally, I have a Python module that might come in handy: https://gist.github.com/stefanedwards/8841307

If you are running the same qsub-system as I have (no idea what ours is called), you cannot use command arguments on the submitted scripts. Instead, any options are submitted via the -v option, that puts them into environment variables, e.g.:

[me@local ~] $ python isprime.py 1
1: True 
[me@local ~] $ head -n 5 isprime.py
#!/usr/bin/python
### This is a python script ...
import os
os.chdir(os.environ.get('PBS_O_WORKDIR','.'))

[me@local ~] $ qsub -v isprime='1 2 3' isprime.py
123456.cluster.control.com
[me@local ~]

Here, isprime.py could handle command line arguments using argparse. Then you just need to check whether the script is running as a submitted job, and then retrieve said arguments from the environment variables (os.environ).

When programme #2 is modified to be run on the cluster, programme #1 can submit jobs by using subprocess.call(['qsub','-v options=...','programme2.py'], shell=FALSE)

Another approach would be to queue all the files in a database (say, an SQLite database). Then you could have programme #1 check all non-processed entries in the database, determine the outcome (run, not run, run with special options). You now have the opportunity to run programme #2 in parallel on the cluster, which simply checks for the database for files to analyse.

Edit: When Programme #2 is an executable

Instead of a python script, we use a bash script that takes environment variables and puts them on a command line for the programme:

#!/bin/bash
cd .
# put options into context/flags etc.
if [ -n $option1 ]; then _opt1="--opt1 $option1"; fi
# we can even define our own defaults
_opt2='--no-verbose'
if [ -n $opt2 ]; then _opt2="-o $opt2"; fi
/path/to/exe $_opt1 $opt2

If you are going for the database solution, then have a python script that checks the database for unprocessed files, mark file as being processed (do these to in a single transaction), get options, call executable with subprocess, when done, mark file as done, check for a new file, etc.

OTHER TIPS

You obviously have built yourself a string cmd containing a command that you could enter in a shell for running the 2nd program. You are currently using subprocess.call(cmd, shell=True) for executing the 2nd program from a Python script (it then becomes executed within a process on the same machine as the calling script).

I understand that you are asking how to submit a job to a cluster so that this 2nd program is run on the cluster instead of the calling machine. Well, this is pretty easy and the method is independent of Python, so there is no 'pythonic' solution, just an obvious one :-) : replace your current cmd with a command that defers the heavy work to the cluster.

First of all, dig into the documentation of your cluster's qsub command (the underlying batch system might be SGE or LSF, or whatever, you need to get the corresponding docs) and try to find the shell command line that properly submits an example job of yours to the cluster. It might look as simple as qsub ...args... cmd, whereas cmd here is the content of the original cmd string. I assume that you now have the entire qsub command needed, let's call it qsubcmd (you have to come up with that on your own, we can't help there). Now all you need to do in your original Python script is calling

subprocess.call(qsubcmd, shell=True)

instead of

subprocess.call(cmd, shell=True)

Note that qsub likely only works on very few machines, typically known as your cluster 'head node(s)'. This means that your Python script that wants to submit these jobs should run on this machine (if that is not possible, you need to add an ssh login procedure to the submission process that we don't want to discuss here).

Please also note that, if you have the time, you should look into the shell=True implications of your subprocess usage. If you can circumvent shell=True, this will be the more secure solution. This might however not be an issue in your environment.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top