문제

I have some python code in which I very often Spawn multiple processes. I get an error:

ORTE_ERROR_LOG: The system limit on number of pipes a process can open was reached in file odls_default_module.c at line 809

My code roughly looks like this

import mpi4py
comm = MPI.COMM_WORLD
...
icomm = MPI.COMM_SELF.Spawn(sys.executable,args=["front_process.py",str(rank)],maxprocs=no_fronts)
...
message = icomm.recv(source=MPI.ANY_SOURCE,tag=21)
...
icomm.Free()

The Spawn command is called very often and I think that they remain "open" after I am finished despite giving the icomm.Free() command. How do I properly "close" a spawned process?

도움이 되었습니까?

해결책

The MPI specification for MPI_COMM_FREE states that "... the object is actually deallocated only if there are no other active references to it." You can disconnect processes by calling MPI_COMM_DISCONNECT on both ends of all intercommunicators that link them. The equivalent mpi4py call is probably icomm.Disconnect().

Still the error that you see probably comes from orterun (symlinked as mpirun and mpiexec) and not from the master rank. orterun is the one who launches all MPI processes (the initial ones and those spawned later) and then redirects their standard output to its own standard output so that you can see the output from each rank. When processes are started on the local host, orterun uses simple fork()/exec() mechanism as part of the odls framework to spawn new ranks and makes use of pipes for detection of successful launch and for IO forwarding. The launch detection pipes are open only for a very short period of time but the IO forwarding pipes remain open as long as the rank is running. If you have many ranks running at the same time, lots of pipes will stay open and hence the error message.

The error message is a bit misleading since there are two cases of "too many descriptors" and Open MPI does not distinguish between them. The first case is when the hard kernel limit is reached but this is usually a huge value. The second case is when the per-process limit on the number of file descriptors is reached. The latter can be controlled with the ulimit command. You should check the value in your case with ulimit -n and eventually increase it. For example:

user@host$ ulimit -n 123456
user@host$ mpiexec -n 1 ... ./spawning_code.py arg1 arg2 ...

Here 123456 is the desired limit on the number of descriptors and it cannot exceed the hard limit that can be obtained with ulimit -nH. If you are running your program from a script (either for convenience or because you submit jobs to some batch queueing system), you should put the ulimit -n line in the script before the call to mpirun/mpiexec.

Also in the text above the words rank and process are used to refer to the same thing.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top