Question

I've been having a bug in my code for some time and could not figure out yet how to solve it.

What I'm trying to achieve is easy enough: every worker-node (i.e. node with rank!=0) gets a row (represented by 1-dimensional arry) in a square-structure that involves some computation. Once the computation is done, this row gets sent back to the master.

For testing purposes, there is no computation involved. All that's happening is:

  • master sends row number to worker, worker uses the row number to calculate the according values
  • worker sends the array with the result values back

Now, my issue is this:

  • all works as expected up to a certain size for the number of elements in a row (size = 1006) and number of workers > 1
  • if the elements in a row exceed 1006, workers fail to shutdown and the program does not terminate
  • this only occurs if I try to send the array back to the master. If I simply send back an INT, then everything is OK (see commented out line in doMasterTasks() and doWorkerTasks())

Based on the last bullet point, I assume that there must be some race-condition which only surfaces when the array to be sent back to the master reaches a certain size.

Do you have any idea what the issue could be?

Compile the following code with: mpicc -O2 -std=c99 -o simple

Run the executable like so: mpirun -np 3 simple <size> (e.g. 1006 or 1007)

Here's the code:

#include "mpi.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define MASTER_RANK 0
#define TAG_RESULT 1
#define TAG_ROW 2
#define TAG_FINISHOFF 3

int mpi_call_result, my_rank, dimension, np;

// forward declarations
void doInitWork(int argc, char **argv);
void doMasterTasks(int argc, char **argv);
void doWorkerTasks(void);
void finalize();
void quit(const char *msg, int mpi_call_result);

void shutdownWorkers() {
    printf("All work has been done, shutting down clients now.\n");
    for (int i = 0; i < np; i++) {
        MPI_Send(0, 0, MPI_INT, i, TAG_FINISHOFF, MPI_COMM_WORLD);
    }
}

void doMasterTasks(int argc, char **argv) {
    printf("Starting to distribute work...\n");
    int size = dimension;
    int * dataBuffer = (int *) malloc(sizeof(int) * size);

    int currentRow = 0;
    int receivedRow = -1;
    int rowsLeft = dimension;
    MPI_Status status;

    for (int i = 1; i < np; i++) {
        MPI_Send(&currentRow, 1, MPI_INT, i, TAG_ROW, MPI_COMM_WORLD);
        rowsLeft--;
        currentRow++;

    }

    for (;;) {
//        MPI_Recv(dataBuffer, size, MPI_INT, MPI_ANY_SOURCE, TAG_RESULT, MPI_COMM_WORLD, &status);
        MPI_Recv(&receivedRow, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);

        if (rowsLeft == 0)
            break;

        if (currentRow > 1004)
            printf("Sending row %d to worker %d\n", currentRow, status.MPI_SOURCE);
        MPI_Send(&currentRow, 1, MPI_INT, status.MPI_SOURCE, TAG_ROW, MPI_COMM_WORLD);
        rowsLeft--;
        currentRow++;
    }
    shutdownWorkers();
    free(dataBuffer);
}

void doWorkerTasks() {
    printf("Worker %d started\n", my_rank);

    // send the processed row back as the first element in the colours array.
    int size = dimension;
    int * data = (int *) malloc(sizeof(int) * size);
    memset(data, 0, sizeof(size));

    int processingRow = -1;
    MPI_Status status;

    for (;;) {

        MPI_Recv(&processingRow, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
        if (status.MPI_TAG == TAG_FINISHOFF) {
            printf("Finish-OFF tag received!\n");
            break;
        } else {
//            MPI_Send(data, size, MPI_INT, 0, TAG_RESULT, MPI_COMM_WORLD);
            MPI_Send(&processingRow, 1, MPI_INT, 0, TAG_RESULT, MPI_COMM_WORLD);
        }
    }

    printf("Slave %d finished work\n", my_rank);
    free(data);
}

int main(int argc, char **argv) {


    if (argc == 2) {
        sscanf(argv[1], "%d", &dimension);
    } else {
        dimension = 1000;
    }

    doInitWork(argc, argv);

    if (my_rank == MASTER_RANK) {
        doMasterTasks(argc, argv);
    } else {
        doWorkerTasks();
    }
    finalize();
}

void quit(const char *msg, int mpi_call_result) {
    printf("\n%s\n", msg);
    MPI_Abort(MPI_COMM_WORLD, mpi_call_result);
    exit(mpi_call_result);
}

void finalize() {
    mpi_call_result = MPI_Finalize();
    if (mpi_call_result != 0) {
        quit("Finalizing the MPI system failed, aborting now...", mpi_call_result);
    }
}

void doInitWork(int argc, char **argv) {
    mpi_call_result = MPI_Init(&argc, &argv);
    if (mpi_call_result != 0) {
        quit("Error while initializing the system. Aborting now...\n", mpi_call_result);
    }
    MPI_Comm_size(MPI_COMM_WORLD, &np);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
}

Any help is greatly appreciated!

Best, Chris

Was it helpful?

Solution

If you take a look at your doWorkerTasks, you see that they send exactly as many data messages as they receive; (and they receive one more to shut them down).

But your master code:

for (int i = 1; i < np; i++) {
    MPI_Send(&currentRow, 1, MPI_INT, i, TAG_ROW, MPI_COMM_WORLD);
    rowsLeft--;
    currentRow++;

}

for (;;) {
    MPI_Recv(dataBuffer, size, MPI_INT, MPI_ANY_SOURCE, TAG_RESULT, MPI_COMM_WORLD, &status);

    if (rowsLeft == 0)
        break;

    MPI_Send(&currentRow, 1, MPI_INT, status.MPI_SOURCE, TAG_ROW, MPI_COMM_WORLD);
    rowsLeft--;
    currentRow++;
}

sends np-2 more data messages than it receives. In particular, it only keeps receiving data until it has no more to send, even though there should be np-2 more data messages outstanding. Changing the code to the following:

int rowsLeftToSend= dimension;
int rowsLeftToReceive = dimension;

for (int i = 1; i < np; i++) {
    MPI_Send(&currentRow, 1, MPI_INT, i, TAG_ROW, MPI_COMM_WORLD);
    rowsLeftToSend--;
    currentRow++;

}

while (rowsLeftToReceive > 0) {
    MPI_Recv(dataBuffer, size, MPI_INT, MPI_ANY_SOURCE, TAG_RESULT, MPI_COMM_WORLD, &status);
    rowsLeftToReceive--;

    if (rowsLeftToSend> 0) {
        if (currentRow > 1004)
            printf("Sending row %d to worker %d\n", currentRow, status.MPI_SOURCE);
        MPI_Send(&currentRow, 1, MPI_INT, status.MPI_SOURCE, TAG_ROW, MPI_COMM_WORLD);
        rowsLeftToSend--;
        currentRow++;
    }
}

Now works.

Why the code doesn't deadlock (note this is deadlock, not a race condition; this is a more common parallel error in distributed computing) for smaller message sizes is a subtle detail of how most MPI implementations work. Generally, MPI implementations just "shove" small messages down the pipe whether or not the receiver is ready for them, but larger messages (since they take more storage resources on the receiving end) need some handshaking between the sender and the receiver. (If you want to find out more, search for eager vs rendezvous protocols).

So for the small message case (less than 1006 ints in this case, and 1 int definitely works, too) the worker nodes did their send whether or not the master was receiving them. If the master had called MPI_Recv(), the messages would have been there already and it would have returned immediately. But it didn't, so there were pending messages on the master side; but it didn't matter. The master sent out its kill messages, and everyone exited.

But for larger messages, the remaining send()s have to have the receiver particpating to clear, and since the receiver never does, the remaining workers hang.

Note that even for the small message case where there was no deadlock, the code didn't work properly - there was missing computed data.

Update: There was a similar problem in your shutdownWorkers:

void shutdownWorkers() {
    printf("All work has been done, shutting down clients now.\n");
    for (int i = 0; i < np; i++) {
        MPI_Send(0, 0, MPI_INT, i, TAG_FINISHOFF, MPI_COMM_WORLD);
    }
}

Here you are sending to all processes, including rank 0, the one doing the sending. In principle, that MPI_Send should deadlock, as it is a blocking send and there isn't a matching receive already posted. You could post a non-blocking receive before to avoid this, but that's unnecessary -- rank 0 doesn't need to let itself know to end. So just change the loop to

    for (int i = 1; i < np; i++)

tl;dr - your code deadlocked because the master wasn't receiving enough messages from the workers; it happened to work for small message sizes because of an implementation detail common to most MPI libraries.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top