Question

I'm trying to parallelize an algorithm using PVM for a University assignment. I've got the algorithm sorted, but parallelization only almost works - the process intermittently gets stuck for no apparent reason. I can see no pattern, a run with the same parameters might work 10 times and then just gets stuck on the next effort...

None of the pvm functions (in the master or any child process) are returning any error codes, the children seem to complete successfully, no errors are reaching the console. It really does just look like the master isn't receiving every communication from the children - but only on occasional runs.

Oddly, though, I don't think it's just skipping a message - I've yet to have a result missing from a child that then successfully sent over a complete signal (that is to say I've not had a run reach completion and return an unexpected result) - it's as though the child just becomes disconnected, and all messages from a certain point cease arriving.

Batching the results up and sending less, but larger, messages seems to improve reliability, at least it feels like it's sticking less often - I don't have hard numbers to back this up...

Is it normal, common or expected that PVM will lose messages sent via pvm_send and it's friends? Please note the error occurs if all processes run on a single host or multiple hosts.

Am I doing something wrong? Is there something I can do to help prevent this?

Update

I've reproduced the error on a very simple test case, code below, which just spawns four children sends a single number to each, each child multiplies the number it receives by five and sends it back. It works almost all the time, but occasionally we freeze with only three numbers printed out - with one child's result missing (and said child will have completed).

Master:

int main()
{
    pvm_start_pvmd( 0 , NULL , 0 );

    int taskIDs[global::taskCount];
    pvm_spawn( "/path/to/pvmtest/child" , NULL , 0 , NULL , global::taskCount , taskIDs );

    int numbers[constant::taskCount] = { 5 , 10 , 15 , 20 };
    for( int i=0 ; i<constant::taskCount ; ++i )
    {
        pvm_initsend( 0 );
        pvm_pkint( &numbers[i] , 1 , 1 );
        pvm_send( taskIDs[i] , 0 );
    }

    int received;
    for( int i=0 ; i<global::taskCount ; ++i )
    {
        pvm_recv( -1 , -1 );
        pvm_upkint( &received , 1 , 1 );
        std::cout << recieved << std::endl;
    }

    pvm_halt();
}

Child:

int main()
{
    int number;

    pvm_recv( -1 , -1 );
    pvm_upkint( &number , 1 , 1 );

    number *= 10;

    pvm_initsend( 0 );
    pvm_pkint( &number , 1 , 1 );
    pvm_send( pvm_parent() , 0 );
}
Was it helpful?

Solution

Not really an answer, but two things have changed together and the problem seems to have subsided:

  1. I added pvm_exit() a call to the end of the slave binary, which apparently is best to do.

  2. The configuration of PVM over the cluster changed ... somehow ... I don't have any specifics, but a few nodes were previously unable to take part in PVM operations and can now can. Other things may have changed as well.

I suspect something within the second changed also happened to fix my problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top