The choice of the correct parallelization configuration for a real application code is never trivial. The optimal mapping of MPI processes and OpenMP threads onto a multiprocessor node depends on the specific implementation of the algorithm, the OpenMP runtime, the internal organization of the cache memory hierarchy and other factors related to the processor architecture.
Therefore users are advised to run different configurations on their specific hardware to find the optimal assignment. You could find a number of reports on such studies among technical reports of research computing facilities and HPC consultancies.
On an m x n
node where m
is the number of processor sockets and n
is the number of CPU cores such an experiment would involve running the code for all possible integral values of the number of MPI processes p
and OpenMP threads q
such that p x q = m x n
for each available compiler.
Here is a plot of the parallel speedup obtained for different combinations of p
and q
for a 4 x 12 AMD Opteron node. Data taken from HiPERiSM Consulting LLC technical report HCTR-2011-2 by George Delic, 2010.
You can see that for this particular code an processor architecture the optimal number of OpenMP threads per MPI process is 1. However the case of 4 threads and 12 MPI processes came close second.