Running slurm script with multiple nodes, launch job steps with 1 task

https://stackoverflow.com//questions/24056961

26-12-2019
|

Question

I am trying to launch a large number of job steps using a batch script. The different steps can be completely different programs and do need exactly one CPU each. First I tried doing this using the --multi-prog argument to srun. Unfortunately, when using all CPUs assigned to my job in this manner, performance degrades massively. The run time increases to almost its serialized value. By undersubscribing I could ameliorate this a little. I couldn't find anything online regarding this problem, so I assumed it to be a configuration problem of the cluster I am using.

So I tried going a different route. I implemented the following script (launched via sbatch my_script.slurm):

#!/bin/bash
#SBATCH -o $HOME/slurm/slurm_out/%j.%N.out
#SBATCH --error=$HOME/slurm/slurm_out/%j.%N.err_out
#SBATCH --get-user-env
#SBATCH -J test
#SBATCH -D $HOME/slurm
#SBATCH --export=NONE
#SBATCH --ntasks=48

NR_PROCS=$(($SLURM_NTASKS))
for PROC in $(seq 0 $(($NR_PROCS-1)));
do
    #My call looks like this:
    #srun --exclusive -n1 bash $PROJECT/call_shells/call_"$PROC".sh &
    srun --exclusive -n1 hostname &
    pids[${PROC}]=$!    #Save PID of this background process
done
for pid in ${pids[*]};
do
    wait ${pid} #Wait on all PIDs, this returns 0 if ANY process fails
done

I am aware, that the --exclusive argument is not really needed in my case. The shell scripts called contain the different binaries and their arguments. The remaining part of my script relies on the fact that all processes have finished hence the wait. I changed the calling line to make it a minimal working example.

At first this seemed to be the solution. Unfortunately when increasing the number of nodes used in my job allocation (for example by increasing --ntasks to a number larger than the number of CPUs per node in my cluster), the script does not work as expected anymore, returning

srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1

and continuing using only one node (i.e. 48 CPUs in my case, which go through the job steps as fast as before, all processes on the other node(s) are subsequently killed).

This seems to be the expected behaviour, but I can't really understand it. Why is it that every job step in a given allocation needs to include a minimum number of tasks equal to the number of nodes included in the allocation. I ordinarily really do not care at all about the number of nodes used in my allocation.

How can I implement my batch script, so it can be used on multiple nodes reliably?

Solution

Found it! The nomenclature and the many command line options to slurm confused me. The solution is given by

#!/bin/bash
#SBATCH -o $HOME/slurm/slurm_out/%j.%N.out
#SBATCH --error=$HOME/slurm/slurm_out/%j.%N.err_out
#SBATCH --get-user-env
#SBATCH -J test
#SBATCH -D $HOME/slurm
#SBATCH --export=NONE
#SBATCH --ntasks=48

NR_PROCS=$(($SLURM_NTASKS))
for PROC in $(seq 0 $(($NR_PROCS-1)));
do
    #My call looks like this:
    #srun --exclusive -N1 -n1 bash $PROJECT/call_shells/call_"$PROC".sh &
    srun --exclusive -N1 -n1 hostname &
    pids[${PROC}]=$!    #Save PID of this background process
done
for pid in ${pids[*]};
do
    wait ${pid} #Wait on all PIDs, this returns 0 if ANY process fails
done

This specifies to run the job on exactly one node incorporating a single task only.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow