In MPI_Send / MPI_Recv pairs, can data be lost if it isn't synchronised correctly?

https://stackoverflow.com/questions/4265693

27-09-2019
|

Question

Let me explain. Consider 4 slave nodes 1, 2, 3, 4 and a master node 0. Now, 1, 2, 3, 4, need to send data to 0. 0 receives this data in the following format.

for(int proc = 1;proc<procCount;proc++) // for each processor cpu (procCount = 5)
{
    for(int p = 0;p<50;p++)
    {

    std::cout<<proc<<"\tA\t"<<p<<std::endl;

    // read in binary datas
   int chunkP;
   int realP;
   real fitnessVal;
   real fitnessValB;
   real fitnessValC;
   int conCount;
   real subConCount;
   real networkEnergyLoss;
   real movementEnergyLoss;
   long spikeCount;

   MPI_Recv (reinterpret_cast < char *>(&chunkP),
      sizeof (chunkP),
                     MPI_CHAR,proc,MPI_ANY_TAG,MPI_COMM_WORLD,&stat);
   MPI_Recv (reinterpret_cast < char *>(&realP),
      sizeof (realP),
                        .
                        .
                        .
           }
     }

Clearly, the order in which 1, 2, 3 and 4 send the data to 0 cannot be assumed (since they are all operating independently of each other -- 2 might send data before 1). So assuming 2 does send its data before 1 (for example), the receiving loop in 0 shown above won't initiate until the source tag 'proc' in the MPI_Recv command is matched to the processor '1' because the outer for loop forces this ordering.

So what happens is the loop 'waits' until there is data incoming from 1 before it can do anything else even if there is already data arriving from 2, 3 and 4. What happens to this data arriving from 2,3 and 4 if it arrives before 1? Can it be 'forgotten' in the sense that once data from '1' does start arriving and then proc increments to 2, the data that it originally tried to receive from 2 is simply not there any more? If it is 'forgotten', the whole distributed simulation will just hang, because it never ends up being able to process the data of a particular slave process correctly.

Thanks, Ben.

Solution

Firstly, do you really mean to receive an MPI_CHAR into chunkP - an int - shouldn't you receive an MPI_INT?

The messages from ranks 1:4 will not get lost - they will get queued until rank 0 chooses to receive them. This behaviour is mandated by the MPI standard.

If the messages are large enough, ranks 1:4 may block until they can actually send their messages to rank 0 (most MPI implementations have limited buffering).

You might also consider having rank 0 do an MPI_ANY_SOURCE receive for the first receive to see who's ready to send. You'll need to take care though to ensure that subsequent receives are posted for the corresponding source - look in the MPI_Status struct to see where the message was actually sent from.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow