What eventually worked for me was proposed solution (1). Here I discuss how I implemented (1) in my condor submit file and my worker shell script.
Here's the shell script. The important change was to check whether R is installed on the compute node via: if [ -f /usr/bin/R ]
. If R is found, we go down a path that ends in a return value of 0. If R is not found, we return 1 (that's the meaning of the lines exit 0
and exit 1
).
mkdir output
if [ -f /usr/bin/R ]
then
if $(uname -m |grep '64')
then
Rscript code/simulations-x86_64.r $*
else
Rscript code/simulations-i386.r $*
fi
tar -zcvf output/output-$1-$2.tgz2 output/*.csv
exit 0
else
exit 1
fi
Now the condor submit file. The crucial change was the second-to-last line (on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
). It checks the return value of each job from the compute node - if the return value is not zero (i.e. if R wasn't found on the compute node), then the job is put back into the queue to be re-run. Otherwise, the job is considered finished and is removed from the queue.
universe = vanilla
log = logs/log_$(Cluster)_$(Process).log
error = logs/err_$(Cluster)_$(Process).err
output = logs/out_$(Cluster)_$(Process).out
executable = condor/worker.sh
arguments = $(Cluster) $(Process)
requirements = (Target.OpSys=="LINUX" && regexp("stat", Machine))
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = code, R-libs, condor, seeds.csv
transfer_output_files = output
notification = Never
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
queue 1800