Question

I'm trying to run a piece of code on a large computer cluster in order to analyze different parts of the data.

I created 2 loops to assign the jobs to different nodes and the cpu's that the nodes contain. The analysis function I wrote, 'chnJob()', just needs to take an index to know what part of the data it needs to analyze (it's the shell variable called 'chn' in this case).

the loop is like this:

for NODE in $NODES; do # Loop through nodes
   for job_idx in {1..$PROCS_PER_NODE}; do # Loop through jobs per node (8 per node)
      echo "this is the channel $chn"
      ssh $NODE "matlab -nodisplay -nodesktop -nojvm -nosplash -r 'cd $WORK_DIR; chnJob($chn); quit'" &
      let chn++
      sleep 2
  done
done

Even though I see that chn variable is being incremented properly, the value of chn that is passed to the matlab function is always the last value of the chn.

This is probably because matlab takes a lot of time to open on each node and bash finishes the loops by then. So the value that is being passed to each matlab instance is only the last value.

Is there a way to circumvent that? Can I 'bake' the value of that variable when I'm calling the function?

Or is the problem entirely different?

Was it helpful?

Solution

Bash can't handle variables in brace range expressions. They have to be literals: {1..10}. Because of the way you have it now, the inner loop is always executed exactly once per iteration of the outer loop instead of eight times (or whatever the value of PROCS_PER_NODE is). As a result, chn goes from its initial value to that plus NODES when it should go from Original_chn to NODES * PROCS_PER_NODE.

Use a C-style for loop instead:

for ((job_idx=1; job_idx<=$PROCS_PER_NODE; job_idx++))

You could increment both job_idx and chn in the for (if that doesn't give you off-by-one problems):

for ((job_idx=1; job_idx<=$PROCS_PER_NODE; job_idx++, chn++))

OTHER TIPS

I don't think that's what's happening. Can you try running this:

cnt=0
for a in 1 2; do 
  for b in 1 2; do 
    echo --- $cnt
    ssh somehost "echo result: '$cnt'" & 
    let cnt++
  done
done

Replace somehost with some host where you have sshd running. This prints numbers 0 - 3 getting back from echo result: '$cnt' getting executed remotely. Thus, executing itself works OK.

One thing that I can suggest is for you to move your command (matlab ...) into some script in a known folder, then run that script in the above loops by giving a full path to that script. Something like:

ssh $NOTE "/path/to/script.sh $cnt"

In the script, $1 will give you the value you want (i.e. $cnt from the loop). You can use echo $1 >> /tmp/values at the beginning of your script to collect all the values in file /tmp/values. Of course, rm /tmp/values before you start. This will confirm whether you are getting all the values as you want them.

If $PBS_NODEFILE contains the filename with the list of nodes (one per line) then this should work:

  seq 1 100 | parallel --slf $PBS_NODEFILE "matlab -nodisplay -nodesktop -nojvm -nosplash -r 'cd $WORK_DIR; chnJob({}); quit'"

Learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top