문제

I am trying to set up a MPI Cluster. But I have the problem that the number of CPUs added to the mpd.conf file is not correctly used. I have three Ubuntu servers. opteron with 48 Cores calc1 with 8 Cores calc2 with 8 Cores.

My mpd.hosts looks like:
opteron:46
calc1:6
calc2:6

After booting (mpdboot -n 3 -f mpd.hosts) the System is running. mpdtrace -> all three of them are listed.

But running a Programm like "mpiexec -n 58 raxmlHPC-MPI ..." causes that calc1 and calc2 get to many jobs and opteron gets to few at the same time. What am I doing wrong?

Regards

Bjoern

도움이 되었습니까?

해결책

I found a workaround. I used the additional parameter "-machinefile /path/to/mpd.hosts" for the mpiexec command. And now, all nodes are running correctly. One problem I got was that I got following error message:

... MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or directory ...

To fix it, I had to set the environment variable MPICH_NO_LOCAL=1

다른 팁

As you figured out, you must pass the machinefile to both mpdboot and mpiexec in order to use per-host process counts. The "open failed" issue is a known bug in MPD, the process manager you are using. Note that the MPICH_NO_LOCAL=1 workaround will work, but will probably result in a big performance penalty for intranode communication.

You are clearly using MPICH2 (or an MPICH2 derivative), but it's not clear what version you are using. If you can, I would strongly recommend upgrading to either MPICH2 1.2.1p1 or (better yet) 1.3.1. Both of these releases include a newer process manager called hydra that is much faster and more robust. In 1.3.1, hydra is the default process manager. It doesn't require an mpdboot phase, and it supports a $HYDRA_HOST_FILE environment variable so that you don't have to specify the machine file on every mpiexec.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top