Question

I am trying to build a script that runs mpi jobs in batch mode at certain hours. If I run in a console mpdallexit, mpdboot and mpirun everything works fine and the parallel jobs start on all nodes in mpd.hosts. But if I try to run from a bash script (sent with at script now +1 minute) the mpd crashes and no jobs are started.

This are the relevant lines in the script

$path_mpi/mpdallexit 
$path_mpi/mpdboot -n 5 &
time $path_mpi/mpirun -n 21 ./rams60 -f RAMSIN.operatiu 
$path_mpi/mpdallexit

and the error messages from log

mpiexec_ventus: cannot connect to local mpd (/tmp/mpd2.console_meteo); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.

I have tried different options from mpdboot

--loccons says you do not want a console available on local mpd(s)
--remcons says you do not want consoles available on remote mpd(s)

or

mpdboot -n 5 &

but without success

Mpich installed at /usr/local/mpich2-1.0.5p4/

EDIT 1:

After trying @shellter advice on sleep I couldn't run the parallel jobs nor with at neither cron. When issuing a batch mpirun job some processes start on the master node but not in the other cluster nodes:

In the master node

ps -ef | grep rams
meteo    28043 26837  0 Apr21 ?        00:00:00 time /usr/bin/mpirun -n 50 -f machinefile ./rams60 -f RAMSIN.operatiu
meteo    28044 28043  0 Apr21 ?        00:00:00 /usr/bin/mpirun -n 50 -f machinefile ./rams60 -f RAMSIN.operatiu
meteo    28050 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28051 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28052 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28053 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28054 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28055 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28056 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28057 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28058 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28059 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28060 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu
meteo    28061 28045  0 Apr21 ?        00:00:00 ./rams60 -f RAMSIN.operatiu

Besides, no output files are created by rams60 while the first thing it does is to write start analysis files.

Everything runs fine if I execute the script in the command line but it seems that mpich can not communicate with the nodes when in batch.

At first I installed mpich2 in the master node and NFS exported to the other nodes. Now I have installed mpich2 in every node.

Thanks

Thanks in advance

Was it helpful?

Solution 3

Finally I could resolve the issue with cron job thanks to Gilles Goullardet in the "mpich-discuss" mailing list.

The problem came from the environment in which batch jobs are run. Cron uses a minimal enviroment so some libs needed for my job were not found in the cluster nodes. I've had to add a line to my script exporting some libs:

export LD_LIBRARY_PATH=/usr/local/mpich2-1.0.5p4/lib:/usr/local/hdf5/lib:$LD_LIBRARY_PATH

Now everything is working fine and my script runs twice a day as desired. Thank you all for your help, in the process I've learned some thins about cron.

OTHER TIPS

The error message is very clear:

mpiexec_ventus: cannot connect to local mpd (/tmp/mpd2.console_meteo); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)

I assume you have mpd running so... check its -n option.

You can use screen to give your script a tty since cron, at or a nested script doesn't allocate one by default. This way you can attach to it if necessary.

screen -D -m <command>

This will launch your command in a detached screen session that will exit when the command finishes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top