Calling multithreaded MKL in from openmp parallel region

Question 1

This is a case of nested parallelism. It is supported by MKL, but it only works if your executable is built using the Intel C/C++ compiler. The reason for that restriction is that MKL uses Intel's OpenMP runtime and that different OMP runtimes do not play well with each other.

Once that is sorted out, you should enable nested parallelism by setting OMP_NESTED to TRUE and disable MKL's detection of nested parallelism by setting MKL_DYNAMIC to FALSE. If the data to be processes with dgemm_ is shared, then you have to invoke the latter from within a single construct. If each thread processes its own private data, then you don't need any synchronisation constructs, but using multithreaded MKL won't give you any benefit too. Therefore I would assume that your case is the former.

To summarise:

#pragma omp single
dgemm_(...);

and run with:

$ MKL_DYNAMIC=FALSE MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 OMP_NESTED=TRUE ./exe

You could also set the parameters with the appropriate calls:

mkl_set_dynamic(0);
mkl_set_num_threads(8);
omp_set_nested(1);

#pragma omp parallel num_threads(8) ...
{
   ...
}

though I would prefer to use environment variables instead.

Question 2

While this post is a bit dated, I would still like to give some useful insights for it.

The above answer is correct from a function perspective, but will not give best results from a performance perspective. The reason is that most OpenMP implementations do not shutdown the threads when they reach a barrier or don't have work to do. Instead, the threads will enter a spin-wait loop and continue to consume processor cycles while they are waiting.

In the example:

#pragma omp parallel
{
    #omp for nowait
    for(...) {}  // first loop

    #omp for
    for(...) {}  // second loop

    #pragma omp single
    dgemm_(....)

    #pragma omp for
    for(...) {}  // third loop
}

What will happen is that even if the dgemm call creates additional threads inside MKL, the outer-level threads will still be actively waiting for the end of the single construct and thus dgemm will run with reduced performance.

There are essentially two solutions to this problem:

1) List item Use the code as above and in addition to the suggested environment variables also disable active waiting:

$ MKL_DYNAMIC=FALSE MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 OMP_NESTED=TRUE OMP_WAIT_MODE=passive ./exe

2) Modify the code to split the parallel regions:

#pragma omp parallel
{
    #omp for nowait
    for(...) {}  // first loop

    #omp for nowait
    for(...) {}  // second loop
}

dgemm_(...);

#pragma omp parallel
    #pragma omp for nowait
    for(...) {}  // third loop
}

For solution 1, the threads go to the sleep mode immediately and do not consume cycles. The downside is that the thread has to wake up from this deeper sleep state, which will increase the latency compared to the spin-wait.

For solution 2, the threads are kept in their spin-wait loop and are very likely actively waiting when the dgemm call enters its parallel region. The additional joins and forks will also introduce some overhead, but it may be better than the over-subscription of the initial solution with the single construct or solution 1.

What is the best solution will clear depend on the amount of work being done in the dgemm operation compared to the synchronization overhead for fork/join, which in mostly dominated by the thread count and the internal implementation.