Does MPI drop or delay messages?

https://stackoverflow.com/questions/7829921

27-10-2019
|

Question

I asked the same question here, but I think it was too long so I'll try again in a shorter way:

I've got a C++ program using the latest OpenMPI on a Rocks cluster under a master/slave setup. The slaves perform a task and then report data to the master using the blocking MPI_SEND / MPI_RECV calls (through Boost MPI), which writes the data to a database. The master is currently significantly slower than the slaves. I'm having trouble with the program because about half of the slaves get stuck on the first task and never report their data; using strace/ltrace, it seems that they're stuck polling in MPI_SEND and their message never gets received.

I wrote a program to test this theory (again, listed in full here) and I can cause a similar problem - slave communications slow down significantly so they do less tasks than they should - by manipulating the speed of the slaves and masters using sleep. When the speed(master) > speed(slaves), everything works fine. When speed(master) < speed(slaves), messages get significantly delayed for some slaves every time.

Any ideas why this might be?

Solution

As far as I see this results from the recv in the while loop in the master node.

 ...
 while (1) {
 // Receive results from slave.
      stat = world.recv(MPI_ANY_SOURCE,MPI_ANY_TAG);
 ...

When there is a message from one slave the master cannot get any messages until the code inside while loop is finished (which take some while as there is a sleep), as the master node is not running parallel. Therefore all other slaves cannot start sending their messages until the first slave has finished sending his message. Then the next slave can start sending the message but then all other slaves are stopped until the code inside the while loop is executed.

This result in the behavior you see, that the slaves communication is very slow. to avoid this problem you need to implement the point to point communication non blocking or use global communications.

UPDATE 1:

Lets assume that the master distributed his data. Now he waits until the slaves report back. When the first slave reports back he will first send his REPORTTAG and then his DONETAG. Now the master will send him back a new job if the

 currentTask < numtasks

Now the slaves start again with his calculation. It might be now the case that until he is finished the master was only able to handle another slave. So the slave of the beginning is now again sending first his REPORTTAG and then his DONETAG and gets an new job. When this continues in the end only 2 slaves have get new jobs and the rest were not able to finish their jobs. So that at some point this is true:

 currentTask >= numtasks

Now you stop all jobs even not all slaves have reported their data back and have done more than one task.

This problem occurs most when the network connection of the different nodes is highly different. The reason is that the send and receive are not handled after their call, instead the communication takes place if two of these functions are able to make some kind of handshake.

As solutions I would suggest either:

Make sure that all slaves are finished before killing all jobs
Use gather and scatter instead of messages, then all slaves synchronized after each task.
Use buffered or unbuffered send and receive operations, if the messages are not to big. Make sure that you did not get a memory overflow on the Master
Change from Master/ Slave to a more parallel workmodus, e.g divide all task to two nodes, then divide the tasks further from these nodes to the next two, and so on. In the end send the task this way back. This solution might also have the advantage that the communication cost are only of O(logn) instead of O(n).

Hope this helped.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow