Question

I'm currently optimizing parts of my code and therefore perform some benchmarking.

I have NxN-matrices A and T and want to multiply them elementwise and save the result in A again, i.e. A = A*T. As this code is not parallelizable I expanded the assignment into

!$OMP PARALLEL DO
do j = 1, N
    do i = 1, N
        A(i,j) = T(i,j) * A(i,j)
    end do
end do
!$OMP END PARALLEL DO

(Full minimal working example at http://pastebin.com/RGpwp2KZ.)

The strange thing happening now is that regardless of the number of threads (between 1 and 4) the execution time stays more or less the same (+- 10%) but instead the CPU time increases with greater number of threads. That made me think that all the threads do the same work (because I made a mistake regarding OpenMP) and therefore need the same time.

But on another computer (that has 96 CPU cores available) the program behaves as expected: With increasing thread number the execution time decreases. Surprisingly the CPU time decreases as well (up to ~10 threads, then rising again).

It might be that there are different versions of OpenMP or gfortran installed. If this could be the cause it'd be great if you could tell me how to find that out.

Was it helpful?

Solution

You could in theory make Fortran array operations parallel by using the Fortran-specific OpenMP WORKSHARE directive:

!$OMP PARALLEL WORKSHARE
A(:,:) = T(:,:) * A(:,:)
!$OMP END PARALLEL WORKSHARE

Note that though this is quite standard OpenMP code, some compilers, and most notably the Intel Fortran Compiler (ifort), implement the WORKSHARE construct simply by the means of the SINGLE construct, giving therefore no parallel speed-up whatsoever. On the other hand, gfortran converts this code fragment into an implicit PARALLEL DO loop. Note that gfortran won't parallelise the standard array notation A = T * A inside the worksharing construct unless it is written explicitly as A(:,:) = T(:,:) * A(:,:).

Now about the performance and the lack of speed-up. Each column of your A and T matrices occupies (2 * 8) * 729 = 11664 bytes. One matrix occupies 8.1 MiB and the two matrices together occupy 16.2 MiB. This probably exceeds the last-level cache size of your CPU. Also the multiplication code has very low compute intensity - it fetches 32 bytes of memory data per iteration and performs one complex multiplication in 6 FLOPs (4 real multiplications, 1 addition and 1 subtraction), then stores 16 bytes back to memory, which results in (6 FLOP)/(48 bytes) = 1/8 FLOP/byte. If the memory is considered to be full duplex, i.e. it supports writing while being read, then the intensity goes up to (6 FLOP)/(32 bytes) = 3/16 FLOP/byte. It follows that the problem is memory bound and even a single CPU core might be able to saturate all the available memory bandwidth. For example, a typical x86 core can retire two floating-point operations per cycle and if run at 2 GHz it could deliver 4 GFLOP/s of scalar math. To keep such core busy running your multiplication loop, the main memory should provide (4 GFLOP/s) * (16/3 byte/FLOP) = 21.3 GiB/s. This quantity more or less exceeds the real memory bandwidth of current generation x86 CPUs. And this is only for a single core with non-vectorised code. Adding more cores and threads would not increase the performance since the memory simply cannot deliver data fast enough to keep the cores busy. Rather, the performance will suffer since having more threads adds more overhead. When run on a multisocket system like the one with 96 cores, the program gets access to more last-level cache and to higher main memory bandwidth (assuming a NUMA system with a separate memory controller in each CPU socket), thus the performance increases, but only because there are more sockets and not because there are more cores.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top