Question

I use MPI non-blocking communication(MPI_Irecv, MP_Isend) to monitor the slaves' idle states, the code is like bellow.

rank 0:

int dest = -1;
while( dest <= 0){
   int i;
   for(i=1;i<=slaves_num;i++){
      printf("slave %d, now is %d \n",i,idle_node[i]);
      if (idle_node[i]== 1) {
         idle_node[i] = 0;
         dest = i;
         break;
      }
   }
   if(dest <= 0){
      MPI_Irecv(&idle_node[1],1,MPI_INT,1,MSG_IDLE,MPI_COMM_WORLD,&request);
      MPI_Irecv(&idle_node[2],1,MPI_INT,2,MSG_IDLE,MPI_COMM_WORLD,&request);
      MPI_Irecv(&idle_node[3],1,MPI_INT,3,MSG_IDLE,MPI_COMM_WORLD,&request);
      // MPI_Wait(&request,&status);
   }
   usleep(100000);
}

idle_node[dest] = 0;//indicates this slave is busy now

rank 1,2,3:

while(1)
{
   ...//do something
   MPI_Isend(&idle,1,MPI_INT,0,MSG_IDLE,MPI_COMM_WORLD,&request);
   MPI_Wait(&request,&status);
}

it works, but I want it to be faster, so I delete the line:

usleep(100000);

then rank 0 goes into a dead while like this:

slave 1, now is 0
slave 2, now is 0
slave 3, now is 0 
slave 1, now is 0
slave 2, now is 0
slave 3, now is 0 
...

So does it indicate that when I use the MPI_Irecv, it just tells MPI I want to receive a message here(haven't received message), and MPI needs other time to receive the real data? or some reasons else?

Was it helpful?

Solution

The use of non-blocking operations has been discussed over and over again here. From the MPI specification (section Nonblocking Communication):

Similarly, a nonblocking receive start call initiates the receive operation, but does not complete it. The call can return before a message is stored into the receive buffer. A separate receive complete call is needed to complete the receive operation and verify that the data has been received into the receive buffer. With suitable hardware, the transfer of data into the receiver memory may proceed concurrently with computations done after the receive was initiated and before it completed.

(the bold text is copied verbatim from the standard; the emphasis in italic is mine)

The key sentence is the last one. The standard does not give any guarantee that a non-blocking receive operation will ever complete (or even start) unless MPI_WAIT[ALL|SOME|ANY] or MPI_TEST[ALL|SOME|ANY] was called (with MPI_TEST* setting a value of true for the completion flag).

By default Open MPI comes as a single-threaded library and without special hardware acceleration the only way to progress non-blocking operations is to either call periodically into some non-blocking calls (with the primary example of MPI_TEST*) or call into a blocking one (with the primary example being MPI_WAIT*).

Also your code leads to a nasty leak that will sooner or later result in resource exhaustion: you are calling MPI_Irecv multiple times with the same request variable, effectively overwriting its value and losing the reference to the previously started requests. Requests that are not waited upon are never freed and therefore remain in memory.

There is absolutely no need to use non-blocking operations in your case. If I understand the logic correctly, you can achieve what you want with code as simple as:

MPI_Recv(&dummy, 1, MPI_INT, MPI_ANY_SOURCE, MSG_IDLE, MPI_COMM_WORLD, &status);
idle_node[status.MPI_SOURCE] = 0;

If you'd like to process more than one worker processes at the same time, it is a bit more involving:

MPI_Request reqs[slaves_num];
int indices[slaves_num], num_completed;

for (i = 0; i < slaves_num; i++)
   reqs[i] = MPI_REQUEST_NULL;

while (1)
{
   // Repost all completed (or never started) receives
   for (i = 1; i <= slaves_num; i++)
      if (reqs[i-1] == MPI_REQUEST_NULL)
         MPI_Irecv(&idle_node[i], 1, MPI_INT, i, MSG_IDLE,
                   MPI_COMM_WORLD, &reqs[i-1]);

   MPI_Waitsome(slaves_num, reqs, &num_completed, indices, MPI_STATUSES_IGNORE);

   // Examine num_completed and indices and feed the workers with data
   ...
}

After the call to MPI_Waitsome there will be one or more completed requests. The exact number will be in num_completed and the indices of the completed requests will be filled in the first num_completed elements of indices[]. The completed requests will be freed and the corresponding elements of reqs[] will be set to MPI_REQUEST_NULL.

Also, there appears to be a common misconception about using non-blocking operations. A non-blocking send can be matched by a blocking receive and also a blocking send can be equally matched by a non-blocking receive. That makes such constructs nonsensical:

// Receiver
MPI_Irecv(..., &request);
... do something ...
MPI_Wait(&request, &status);

// Sender
MPI_Isend(..., &request);
MPI_Wait(&request, MPI_STATUS_IGNORE);

MPI_Isend immediately followed by MPI_Wait is equivalent to MPI_Send and the following code is perfectly valid (and easier to understand):

// Receiver
MPI_Irecv(..., &request);
... do something ...
MPI_Wait(&request, &status);

// Sender
MPI_Send(...);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top