MPI with C: Passive RMA synchronization

Question 1

First of all, the passive RMA mechanism does not somehow magically poke into the remote process' memory since not many MPI transports have real RDMA capabilities and even those that do (e.g. InfiniBand) require a great deal of not-that-passive involvement of the target in order to allow for passive RMA operations to happen. This is explained in the MPI standard but in the very abstract form of public and private copies of the memory exposed through an RMA window.

Achieving working and portable passive RMA with MPI-2 involves several steps.

Step 1: Window allocation in the target process

For portability and performance reasons the memory for the window should be allocated using MPI_ALLOC_MEM:

int size;
MPI_Comm_rank(MPI_COMM_WORLD, &size);

int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);

for (int i = 0; i < size; i++)
{
   schedule[i] = 0;
}

MPI_Win win;
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
   MPI_COMM_WORLD, &win);

...

MPI_Win_free(win);
MPI_Free_mem(schedule);

Step 2: Memory synchronisation at the target

The MPI standard forbids concurrent access to the same location in the window (§11.3 from the MPI-2.2 specification):

It is erroneous to have concurrent conflicting accesses to the same memory location in a window; if a location is updated by a put or accumulate operation, then this location cannot be accessed by a load or another RMA operation until the updating operation has completed at the target.

Therefore each access to schedule[] in the target has to be protected by a lock (shared since it only reads the memory location):

while (!ready)
{
   MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
   ready = fnz(schedule, oldschedule, size);
   MPI_Win_unlock(0, win);
}

Another reason for locking the window at the target is to provide entries into the MPI library and thus facilitate progression of the local part of the RMA operation. MPI provides portable RMA even when using non-RDMA capable transports, e.g. TCP/IP or shared memory, and that requires a lot of active work (called progression) to be done at the target in order to support "passive" RMA. Some libraries provide asynchronous progression threads that can progress the operation in the background, e.g. Open MPI when configured with --enable-opal-multi-threads (disabled by default), but relying on such behaviour results in non-portable programs. That's why the MPI standard allows for the following relaxed semantics of the put operation (§11.7, p. 365):

6 . An update by a put or accumulate call to a public window copy becomes visible in the private copy in process memory at latest when an ensuing call to MPI_WIN_WAIT, MPI_WIN_FENCE, or MPI_WIN_LOCK is executed on that window by the window owner.

If a put or accumulate access was synchronized with a lock, then the update of the public window copy is complete as soon as the updating process executed MPI_WIN_UNLOCK. On the other hand, the update of private copy in the process memory may be delayed until the target process executes a synchronization call on that window (6). Thus, updates to process memory can always be delayed until the process executes a suitable synchronization call. Updates to a public window copy can also be delayed until the window owner executes a synchronization call, if fences or post-start-complete-wait synchronization is used. Only when lock synchronization is used does it becomes necessary to update the public window copy, even if the window owner does not execute any related synchronization call.

This is also illustrated in Example 11.12 in the same section of the standard (p. 367). And indeed, both Open MPI and Intel MPI do not update the value of schedule[] if the lock/unlock calls in the code of the master are commented out. The MPI standard further advises (§11.7, p. 366):

Advice to users. A user can write correct programs by following the following rules:

...

lock: Updates to the window are protected by exclusive locks if they may conflict. Nonconflicting accesses (such as read-only accesses or accumulate accesses) are protected by shared locks, both for local accesses and for RMA accesses.

Step 3: Providing the correct parameters to MPI_PUT at the origin

MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); would transfer everything into the first element of the target window. The correct invocation given that the window at the target was created with disp_unit == sizeof(int) is:

int one = 1;
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);

The local value of one is thus transferred into rank * sizeof(int) bytes following the beginning of the window at the target. If disp_unit was set to 1, the correct put would be:

MPI_Put(&one, 1, MPI_INT, 0, rank * sizeof(int), 1, MPI_INT, win);

Step 4: Dealing with implementation specifics

The above detailed program works out-of-the box with Intel MPI. With Open MPI one has to take special care. The library is built around a set of frameworks and implementing modules. The osc (one-sided communication) framework comes in two implementations - rdma and pt2pt. The default (in Open MPI 1.6.x and probably earlier) is rdma and for some reason it does not progress the RMA operations at the target side when MPI_WIN_(UN)LOCK is called, which leads to deadlock-like behaviour unless another communication call is made (MPI_BARRIER in your case). On the other hand the pt2pt module progresses all operations as expected. Therefore with Open MPI one has to start the program like following in order to specifically select the pt2pt component:

$ mpiexec --mca osc pt2pt ...

A fully working C99 sample code follows:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>

// Compares schedule and oldschedule and prints schedule if different
// Also displays the time in seconds since the first invocation
int fnz (int *schedule, int *oldschedule, int size)
{
    static double starttime = -1.0;
    int diff = 0;

    for (int i = 0; i < size; i++)
       diff |= (schedule[i] != oldschedule[i]);

    if (diff)
    {
       int res = 0;

       if (starttime < 0.0) starttime = MPI_Wtime();

       printf("[%6.3f] Schedule:", MPI_Wtime() - starttime);
       for (int i = 0; i < size; i++)
       {
          printf("\t%d", schedule[i]);
          res += schedule[i];
          oldschedule[i] = schedule[i];
       }
       printf("\n");

       return(res == size-1);
    }
    return 0;
}

int main (int argc, char **argv)
{
    MPI_Win win;
    int rank, size;

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0)
    {
       int *oldschedule = malloc(size * sizeof(int));
       // Use MPI to allocate memory for the target window
       int *schedule;
       MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);

       for (int i = 0; i < size; i++)
       {
          schedule[i] = 0;
          oldschedule[i] = -1;
       }

       // Create a window. Set the displacement unit to sizeof(int) to simplify
       // the addressing at the originator processes
       MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
          MPI_COMM_WORLD, &win);

       int ready = 0;
       while (!ready)
       {
          // Without the lock/unlock schedule stays forever filled with 0s
          MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
          ready = fnz(schedule, oldschedule, size);
          MPI_Win_unlock(0, win);
       }
       printf("All workers checked in using RMA\n");

       // Release the window
       MPI_Win_free(&win);
       // Free the allocated memory
       MPI_Free_mem(schedule);
       free(oldschedule);

       printf("Master done\n");
    }
    else
    {
       int one = 1;

       // Worker processes do not expose memory in the window
       MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);

       // Simulate some work based on the rank
       sleep(2*rank);

       // Register with the master
       MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
       MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
       MPI_Win_unlock(0, win);

       printf("Worker %d finished RMA\n", rank);

       // Release the window
       MPI_Win_free(&win);

       printf("Worker %d done\n", rank);
    }

    MPI_Finalize();
    return 0;
}

Sample output with 6 processes:

$ mpiexec --mca osc pt2pt -n 6 rma
[ 0.000] Schedule:      0       0       0       0       0       0
[ 1.995] Schedule:      0       1       0       0       0       0
Worker 1 finished RMA
[ 3.989] Schedule:      0       1       1       0       0       0
Worker 2 finished RMA
[ 5.988] Schedule:      0       1       1       1       0       0
Worker 3 finished RMA
[ 7.995] Schedule:      0       1       1       1       1       0
Worker 4 finished RMA
[ 9.988] Schedule:      0       1       1       1       1       1
All workers checked in using RMA
Worker 5 finished RMA
Worker 5 done
Worker 4 done
Worker 2 done
Worker 1 done
Worker 3 done
Master done

Question 2

The answer by Hristo Lliev works perfectly if I use newer versions of the Open-MPI library.

However, on the cluster we are currently using, this is not possible and for the older versions there was deadlock behavior for the final unlock calls, as described by Hhristo. Adding the options --mca osc pt2pt did solve the deadlock in a sense but the MPI_Win_unlock calls still didn't seem to complete until the process owning the accessed variable did its own lock/unlock of the window. This is not very useful when you have jobs with very different completion times.

Therefore from a pragmatical point of view, though strictly speaking leaving the topic of passive RMA synchronization (for which I do apologize), I would like to point out a workaround which makes use of external files for those who are stuck with using old versions of the Open-MPI library so they don't have to loose so much time as I did:

You basically create an external file containing the information about which (slave) process does which job, instead of an internal array. This way, you don't even have to have a master process only dedicated to the bookkeeping of the slaves: It can also perform a job. Anyway, every process can go look in this file which job is to be done next and possibly determine that everything is done.

The important point is now that this information file is not accessed at the same time by multiple processes, as this might cause work to be duplicated or worse. The equivalent of the locking and unlocking of the window in MPI is here imitated easiest by using a locking file: This file is created by the process currently accessing the information file. The other processes have to wait for the current process to finish by checking with a slight time delay whether the lock file still exists.

The full information can be found here.