Question

I have a script that submits a number of jobs to run in parallel on an SGE queue, and another gathering script that is executed when this list of jobs are finished. I am using -hold_jid wc_job_list to hold the execution of the gathering script while the parallel jobs are running.

I just noticed that sometimes some of the parallel jobs fail and the gathering script still runs. The documentation states that:

If any of the referenced jobs exits with exit code 100, the submitted job will remain ineligible for execution.

How can I catch the parallel failed jobs exit status so that if any of them fail for any reason, the gathering script is not executed or gives an error message?

Was it helpful?

Solution

In case of BASH, you could parse the exit status of your program (can be referenced as $?) and in the case of not being 0 (which is the exit status for normal termination), call exit 100 at the end of your jobscript.

The problem with this is, that your job will remain in the queue in state Eqw and has to be deleted manually.

UPDATE: For every job you set to Eqw your administrators get an email...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top