Question

I am using OpenMP and MPI to parallelize some matrix operations in c. Some of the functions operating on the matrix are written in Fortran. The Fortran functions require a buffer array to be passed in which is only used internally in the function. Currently I am allocating buffers in each parallel section similar to the code below.

int i = 0;
int n = 1024; // Actually this is read from command line
double **a = createNbyNMat(n);
#pragma omp parallel
{
    double *buf;
    buf = malloc(sizeof(double)*n);
#pragma omp for
    for (i=0; i < n; i++)
    {
        fortranFunc1_(a[i], &n, buf);
    }
    free(z);
}

// Serial code and moving data around in the matrix a using MPI

#pragma omp parallel
{
    double *buf;
    buf = malloc(sizeof(double)*n);
#pragma omp for
    for (i=0; i < n; i++)
    {
        fortranFunc2_(a[i], &n, buf);
    }
    free(z);
}

// and repeat a few more times.

I know reallocating the buffers can be avoided using a method similar to the code below, but I was curious if there is an easier way or some built in functionality in OpenMP for handling this. It would be nice to be able to compile the code without a lot of compiler directives whether or not OpenMP is present on the system we are compiling for.

double **buf;
buf = malloc(sizeof(double*) * num_openmp_threads);
int i = 0;
for (i = 0; i < num_openmp_threads; ++i)
{
    buf[i] = malloc(sizeof(double) * n);
}

// skip ahead

#pragma omp for
for (i=0; i < n; i++)
{
    fortranFunc1_(a[i], &n, buf[current_thread_num]);
}
Was it helpful?

Solution

It is possible to do it using thread-private variables. Those persist across subsequent parallel regions:

void func(...)
{
   static double *buf;
   #pragma omp threadprivate(buf)

   #pragma omp parallel num_threads(nth)
   {
       buf = malloc(n * sizeof(double));
       ...
   }

   #pragma omp parallel num_threads(nth)
   {
       // Access buf here - it is still allocated
   }

   #pragma omp parallel num_threads(nth)
   {
       // Free the memory in the last parallel region
       free(buf);
   }
}

There are several key points to notice here. First, the number of threads that allocate buf should match the number of threads that deallocate it. Also, if there are parallel regions in between and they execute with larger teams, buf might not be allocated in all of them. Therefore it is advisable to either disable the dynamic team size feature of OpenMP or to simply use the num_threads clause as shown above to fix the number of threads for each parallel region.

Second, local variables can be made thread-private only if they are static. Therefore, this method is not suitable for use in recursive functions.

The code should compile and work as expected even if OpenMP support is disabled.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top