OpenMPI fault tolerance

https://stackoverflow.com/questions/19615931

01-07-2022
|

Question

I have an assignment to implement simple fault-tolerance in an OpenMPI application. The problem we are having is that, despite setting the MPI error handling to MPI_ERRORS_RETURN, when one of our nodes is unplugged from the cluster we get the following error on the next MPI_ call after a lengthy hang:

[btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() failed: Connection timed out (110)

My take from this is that it is not possible to continue processing on all other nodes when one node drops from the network with OpenMPI. Can anyone confirm this for me, or point me in a direction for preventing the btl_tcp_endpoint error?

We are using OpenMPI version 1.6.5.

Solution

The MPI_ERRORS_RETURN code paths are not well tested (and probably not well implemented) in Open MPI. They simply haven't been a priority, so we've never really done much work in this area.

Sorry.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow