Wait for all jobs of a user to finish before submitting subsequent jobs to a PBS cluster

Question

Filling in following the solution suggested by Jonathan in the comments.

There are several resource managers based on the original Portable Batch System: OpenPBS, TORQUE and PBS Professional. The systems had diverged significantly and use different command syntax for newer features such as job arrays.

Job arrays are a convenient way to submit multiple similar jobs based on the same job script. Quoting from the manual:

Sometimes users will want to submit large numbers of jobs based on the same job script. Rather than using a script to repeatedly call qsub, a feature known as job arrays now exists to allow the creation of multiple jobs with one qsub command.

To submit a job array PBS provides the following syntax:

 qsub -t 0-10,13,15 script.sh

this submits jobs with ids from 0,1,2,...,10,13,15.

Within the script the variable PBS_ARRAYID carries the id of the job within the array and can be used to pick the necessary configuration.

Job array have their specific dependency options.

TORQUE

TORQUE resource manager that is probably used in the OP. There additional dependency options are provided that can be seen in the following example:

$ qsub -t 1-1000 script.sh
1234[].pbsserver.domainname
$ qsub -t 1001-2000 -W depend=afterokarray:1234[] script.sh
1235[].pbsserver.domainname

This will result in the following qstat output

1234[]         script.sh    user          0 R queue
1235[]         script.sh    user          0 H queue

Tested on torque version 3.0.4

The full afterokarray syntax is in the qsub(1) manual.

PBS Professional

In PBS Professional dependencies can work uniformly on ordinary jobs and array jobs. Here is an example:

$ qsub -J 1-1000 -ry script.sh
1234[].pbsserver.domainname
$ qsub -J 1001-2000 -ry -W depend=afterok:1234[] script.sh
1235[].pbsserver.domainname

This will result in the following qstat output

1234[]         script.sh    user          0 B queue
1235[]         script.sh    user          0 H queue

Update on Torque versions

Array dependencies became available in Torque since version 2.5.3. Job arrays from version 2.5 are not compatible with job arrays in versions 2.3 or 2.4. In particular the [] syntax was introduced in Torque since version 2.5.

Update on using a delimeter job

For torque versions prior to 2.5 a different solution may work that is based on submitting dummy delimeter jobs between batches of jobs to be separated. It uses three dependency types: on,before, and after.

Consider the following example

 $ DELIM=`qsub -Wdepend=on:1000 dummy.sh `
 $ qsub -Wdepend=beforeany:$DELIM script.sh
 1001.pbsserver.domainname
 ... another 998 jobs ...
 $ qsub -Wdepend=beforeany:$DELIM script.sh
 2000.pbsserver.domainname
 $ qsub -Wdepend=after:$DELIM script.sh
 2001.pbsserver.domainname
 ...

This will result in the queue state like this

1000         dummy.sh    user          0 H queue
1001         script.sh   user          0 R queue   
...
2000         script.sh   user          0 R queue   
2001         script.sh   user          0 H queue
...

That is the job #2001 will run only after the previous 1000 jobs terminate. Probably the rudimentary job array facilities available in TORQUE 2.4 can be used as well to submit the script job.

This solution will also work for TORQUE version 2.5 and higher.