Question

I installed Linpack on a 2-Node cluster with Xeon processors. Sometimes if I start Linpack with this command:

mpiexec -np 28 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64

linpack starts and prints the output, sometimes I only see the mpi mappings printed and then nothing following. To me this seems like random behaviour because I don't change anything between the calls and as already mentioned, Linpack sometimes starts, sometimes not. In top I can see that xhpl_intel64processes have been created and they are heavily using the CPU but when watching the traffic between the nodes, iftop is telling me that it nothing is sent.

I am using MPICH2 as MPI implementation. This is my HPL.dat:

# cat HPL.dat 
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
10000         Ns
1            # of NBs
250          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
14            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

edit2:

I now just let the program run for a while and after 30min it tells me:

# mpiexec -np 32 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
(node-0:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
(node-1:16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31)
Assertion failed in file ../../socksm.c at line 2577: (it_plfd->revents & 0x008) == 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)

Is this a mpi problem?

Do you know what type of problem this could be?

Was it helpful?

Solution

I figured out what the problem was: MPICH2 uses different random ports each time it starts and if these are blocked your application wont start up correctly. The solution for MPICH2 is to set the environment variable MPICH_PORT_RANGE to START:END, like this:

export MPICH_PORT_RANGE=50000:51000

Best, heinrich

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top