This is a case of nested parallelism. It is supported by MKL, but it only works if your executable is built using the Intel C/C++ compiler. The reason for that restriction is that MKL uses Intel's OpenMP runtime and that different OMP runtimes do not play well with each other.
Once that is sorted out, you should enable nested parallelism by setting OMP_NESTED
to TRUE
and disable MKL's detection of nested parallelism by setting MKL_DYNAMIC
to FALSE
. If the data to be processes with dgemm_
is shared, then you have to invoke the latter from within a single
construct. If each thread processes its own private data, then you don't need any synchronisation constructs, but using multithreaded MKL won't give you any benefit too. Therefore I would assume that your case is the former.
To summarise:
#pragma omp single
dgemm_(...);
and run with:
$ MKL_DYNAMIC=FALSE MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 OMP_NESTED=TRUE ./exe
You could also set the parameters with the appropriate calls:
mkl_set_dynamic(0);
mkl_set_num_threads(8);
omp_set_nested(1);
#pragma omp parallel num_threads(8) ...
{
...
}
though I would prefer to use environment variables instead.