Question

I'm trying to get a UPC-NAS Benchmark (compiled for 256 threads) running on a cluster of 32 nodes. When I run it, the rsh connections are established for 247 threads and it terminates giving an error as follows

p0_11350:  p4_error: Child process exited while making connection to remote process on dell16: 0
506 rm_l_237_24446: (26.785156) net_send: corm_11947: (215.339844) net_srm_l_1rm_24412: (26.785156) net_send: could not write to fd=4, errnrrrm_l_127_5013: (121.984375) net_send: could not w    rite to fd=5, errno = 32

Can anybody point out where the problem lies ?

It runs fine for lesser threads like 64, 128 etc.

Was it helpful?

Solution

Errno 32 is EPIPE (#define EPIPE 32 /* Broken pipe */).

I suggest, that some file descriptor limit is hitted (check ulimit -a). Or network limits. Or network failure.

Also I should mention, that p4 is anciently old. It can be some internal limit. The development of p4 stopped > 15 years ago. It is kind of very stable code in terms of inclusion into Debian Stable.

So, why do you use mpich1? Can you move to less ancient mpich2?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top