Open MPI/MPICH - What happens if a node terminates?

https://stackoverflow.com/questions/4194965

11-10-2019
|

Question

I would like to know what happens if a node of a OpenMPI/MPICH2 cluster terminates? Is there some mechanism that is tolerant for this case and continues the execution?

Thanks for your answers Heinrich

Solution

Note that a feature that has existed since MPI 1.x days is that you can set an error handler: eg,

http://www.mpi-forum.org/docs/mpi-11-html/node148.html

As Mark notes, most of us just use MPI_ERRORS_ARE_FATAL (which is the default) because our algorithms are very state-heavy and can't easily be recovered (except through checkpointing, which most of us do anyway).

But that need not be the case; you can have the MPI functions return the error messages and try to recover as best you can.

There are a few fault-tolerant MPI packages out there -- http://icl.cs.utk.edu/ftmpi/ (which is kind of old and only implements MPI 1.2 functionality). More recently, http://osl.iu.edu/research/ft/cifts/ is one approach being put into OpenMPI as a separate project, and there is also an OS-level checkpoint/restart package, BLCR, which may be of interest.

The MPI-3 forum is discussing a standard fault-tolerance API in MPI, so the pace of such projects is accellerating.

OTHER TIPS

Not really, MPI doesn't provide out-of-the-box fault tolerance. You could write your programs to deal with the failure of a process, but most of us don't, we live with our programs crashing when the hardware dies. This situation is changing with the emergence of supercomputers with hundreds of thousands of processors, and mean-time between failures of the order of seconds.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow