GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node

StackOverflow https://stackoverflow.com/questions/22236337

  •  10-06-2023
  •  | 
  •  

Question

I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this scenario.

When I run a PBS script calling GNU parallel without the --jobs option, like this:

#PBS -lnodes=2:ppn=2
...
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
  matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40

it looks like it only uses one CPU per core, and also provides the following error stream:

bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles087 (). Using 1.
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles108 (). Using 1.

This looks like one error for each node. I don't understand the first part (bash: parallel: command not found), but the second part tells me it's using one node.

When I add the option -j2 to the parallel call, the errors go away, and I think that it's using two CPUs per node. I am still a newbie to HPC, so my way of checking this is to output date-time stamps from my code (the dummy matlab code takes 10's of seconds to complete). My questions are:

  1. Am I using the --jobs option correctly? Is it correct to specify -j2 because I have 2 CPUs per node? Or should I be using -jN where N is the total number of CPUs (number of nodes multiplied by number of CPUs per node)?
  2. It appears that GNU parallel attempts to determine the number of CPUs per node on it's own. Is there a way that I can make this work properly?
  3. Is there any meaning to the bash: parallel: command not found message?
Was it helpful?

Solution

  1. Yes: -j is the number of jobs per node.
  2. Yes: Install 'parallel' in your $PATH on the remote hosts.
  3. Yes: It is a consequence from parallel missing from the $PATH.

GNU Parallel logs into the remote machine; tries to determine the number of cores (using parallel --number-of-cores) which fails and then defaults to 1 CPU core per host. By giving -j2 GNU Parallel will not try to determine the number of cores.

Did you know that you can also give the number of cores in the --sshlogin as: 4/myserver ? This is useful if you have a mix of machines with different number of cores.

OTHER TIPS

This is not an answer to the 3 primary questions, but I'd like to point out some other problems with the parallel statement in the first code block.

parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
  matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40

The shell expands the $PBS_O_WORKDIR prior to executing parallel. This means two things happen (1) the --env sees a filename rather than an environment variable name and essentially does nothing and (2) expands as part command string eliminating the need to pass $PBS_O_WORKDIR which is why there wasn't an error.

The latest version of parallel 20151022 has a workdir option (although the tutorial lists it as alpha testing) which is probably the easiest solution. The parallel command line would look something like:

parallel --workdir $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
  matlab -nodisplay -r "primes1({})" :::: 10 20 30 40

Final note, PBS_NODEFILE may contain hosts listed multiple times if more than one processor is requested by qsub. This many have implications for number of jobs run, etc.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top