Question

There are some commands I'd like to run on a grid using qsub (SGE 8.1.3, CentOS 5.9) that need to use a pipe (|) or a redirect (>). For example, let's say I have to parallelize the command

echo 'hello world' > hello.txt

(Obviously a simplified example: in reality I might need to redirect the output of a program like bowtie directly to samtools). If I did:

qsub echo 'hello world' > hello.txt

the resulting content of hello.txt would look like

Your job 123454321 ("echo") has been submitted

Similarly if I used a pipe (echo "hello world" | myprogram), that message is all that would be passed to myprogram, not the actual stdout.

I'm aware I could write a small bash script that each contain the command with the pipe/redirect, and then do qsub ./myscript.sh. However, I'm trying to run many parallelized jobs at the same time using a script, so I'd have to write many such bash scripts each with a slightly different command. When scripting this solution can start to feel very hackish. An example of such a script in Python:

for i, (infile1, infile2, outfile) in enumerate(files):
    command = ("bowtie -S %s %s | " +
               "samtools view -bS - > %s\n") % (infile1, infile2, outfile)

    script = "job" + str(counter) + ".sh"
    open(script, "w").write(command)
    os.system("chmod 755 %s" % script)
    os.system("qsub -cwd ./%s" % script)

This is frustrating for a few reasons, among them that my program can't even delete the many jobXX.sh scripts afterwards to clean up after itself, since I don't know how long the job will be waiting in the queue, and the script has to be there when the job starts.

Is there a way to provide my full echo 'hello world' > hello.txt command to qsub without having to create another file containing the command?

Was it helpful?

Solution

You can do this by turning it into a bash -c command, which lets you put the | in a quoted statement:

 qsub bash -c "cmd <options> | cmd2 <options>"

As @spuder has noted in the comments, it seems that in other versions of qsub (not SGE 8.1.3, which I'm using), one can solve the problem with:

echo "cmd <options> | cmd2 <options>" | qsub

as well.

OTHER TIPS

Although my answer is a bit late I am adding it for any incoming viewers. To use a pipe/direct and submit that as a qsub job you need to do a couple of things. But first, using qsub at the end of a pipe like you're doing will only result in one job being sent to the queue (i.e. Your code will run serially rather than get parallelized).

  1. Run qsub with enabling binary mode since the default qsub behavior rather expects compiled code. For that you use the "-b y" flag to qsub and you'll avoid any errors of the sort "command required for a binary mode" or "script length does not match declared length".
  2. echo each call to qsub and then pipe that to shell.

Suppose you have a file params-query.txt which hold several bowtie commands and piped calls to samtools of the following form:

bowtie -q query -1 param1 -2 param2 ... | samtools ...

To send each query as a separate job first prepare your command line units from STDIN through xargs STDIN. Notice the quotes around the braces are important if you are submitting a command of piped parts. That way your entire query is treated a single unit.

cat params-query.txt | xargs -i echo qsub -b y -o output_log  -e error_log -N job_name \"{}\" | sh 

If that didn't work as expected then you're probably better off generating an intermediate output between bowtie and samtools before calling samtools to accept that intermediate output. You won't need to change the qsub call through xargs but the code in params-query.txt should look like:

bowtie -q query -o intermediate_query_out -1 param1 -2 param2 && samtools read_from_intermediate_query_out

This page has interesting qsub tricks you might like

grep http *.job | awk -F: '{print $1}' | sort -u | xargs -I {} qsub {}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top