Question

I'm currently parallelizing (using MPICH) an existing, old FORTRAN program (which is performing some data-inversion of only one pixel of an image --> need for parallelization). My strategy is writing a new main routine which performs all I/O, passes the information (of one pixel) to each of the nodes, and calls the existing program (which I converted to a subroutine). The nodes are doing the calculations independently of each other and pass the results back to the main routine.

After running it on our cluster I discovered strange problems: on some nodes the calculations are not well done because at some point NaN-values appear on some variables (which do not occur when I use the sequential version). After this happened for the first pixel (more or less randomly, but dependent on the computer I'm running it and dependent on Compiler options) this behavior is repeated periodically with the number of nodes I am using (because it happens also on another computer it is not caused by broken CPUs).

I am using MPICH 3.0.4, ifort 11.1 and 12.05 (I tried it with both versions) with

-heap_arrays -O2 -save -w -fp-model precise -fp-model source -mcmodel=medium -extend_source -shared-intel

My general question: is it possible that somehow the memory of the nodes is getting mixed up s.th. NaN appears on strange places in the memory (this would not explain the periodic behavior though)? Or what else could lead to memory leaks (I can exclude the numerics)? My root node does allocate huge arrays (approx. 2GB in total), but on the cluster I have 128GB RAM per node availaible, and the slave nodes use far less memory!

The next problem: I am not able to debug the program properly: using idb I only find out that it suddenly happens that there turn up NaNs due to the complex numerics and lots of loops in the old routines. I cannot use valgrind either because due to this bug it is impossible to use it with large executables (mine is ~5MB). Intel inspxe just tells me that I have a memory leak (in the line of the definition of the subroutine).

Any suggestions? Cheers, stefan

Was it helpful?

Solution

Thanks for your help. tracking the NaNs using -fpe=0 led me in the right direction, the problem was connected to floating point precision: I always used-fp-model precise -fp-model source, believing that this would give me the highest precision. It turned out that using -fltconsistency instead solved all my problems!

No more errors.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top