Question

I've been working in a C++ code to perform quantum chemistry, atomic and molecular tasks in which is implied lots of work with arrays (1D, 2D, 3D etc). And I have an entire class called array to handle this. Of course, from the very begining the most fundamental member functions are those to dynamically allocate memory for those arrays, to resize, or delete them.

data = new double **[row]();
#pragma omp for schedule(static) nowait
for(unsigned int i = 0; i < row; ++i)
{
    data[i] = new double *[column]();
}

Now what I am doing is to speed up these routines with OpenMP. For most part of the routines I've been using the schedule(static) nowait clasues to divide my loops into chunks of step/threads, since those chunks spends almost the same time to be handled by their thread.

But to loops like the one above, with several calls of the new operator, I have the (bad) feeling that chunks of theses loops do not takes the same time to execute in their threads, in the sense that I should consider to apply schedule(dynamic, chunk_size) instead.

Do you guys agree? the dynamic allocation is not such simple task and could be expensive, so chunks of dynamic allocations could differ on its time of execution.

Actually I am not sure if I am not doing any kind of mistake concerning stack fragmentation or things like that. Advices would be welcome.

PS.: I am using the nowait clause to try to minimize bottlenecks of implicit barriers.

Was it helpful?

Solution

Your particular loop probably does not offer much opportunity for parallelism if you are using the default new operator, since the heap is a single resource and access to it needs to be serialized through a mutex. However, assuming you have other loops for which you wish to use OpenMP, the following should help.

From the OpenMP 3.1 specification:

static When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case.

dynamic When schedule(dynamic, chunk_size) is specified, the iterations are distributed to threads in the team in chunks as the threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed.

Each chunk contains chunk_size iterations, except for the last chunk to be distributed, which may have fewer iterations.

When no chunk_size is specified, it defaults to 1.

In your case, you are not specifying chunk_size, so the number of iterations for each task is unspecified.

In general, I prefer to have some control over the number of threads and the number of iterations each task is executing. I found (on Windows, compiled with mingw-w64), there is significant overhead for tasks starting a new chunk of work, so it is beneficial to give them as big of a chunk as possible. What I tend to do is use dynamic (though I could use static for fixed execution time tasks), and set the chunk_size to be the loop count divided by the number of threads. In your case, if you suspect uneven task execution time, you could divide this by 2 or 4.

// At the top of a C++ file:
static int NUM_THREADS = omp_get_num_procs();

// Then for your loop construct (I'm using a combined parallel for here):
#pragma omp parallel for num_threads(NUM_THREADS) \
   schedule(dynamic, row / NUM_THREADS / 2)
for(unsigned int i = 0; i < row; ++i)
{
   data[i] = new double *[column]();
}

Note also that if you do not set num_threads, the default will be nthreads-var, which is determined from omp_get_max_threads.

Regarding the nowait clause, obviously make sure you are not using data outside your loop construct. I'm using a combined parallel loop construct above, which means nowait cannot be specified.

OTHER TIPS

There's a much cleaner way to do all of this:

std::vector<double> actual_data(omp_get_num_procs() * column);

std::vector<double *> data(omp_get_num_procs());
for (unsigned i = 0; i < row; i++) {
    data[i] = &(actual_data[i * column]);
}

You now have a single array, allocated in one go, with an array of pointers into that array. Inside parallel algorithms, you can use data[i][j] to get to the member you want, with practically zero-overhead (the overhead occurs at compile-time).

The only potential risk is that of false sharing because the rows of your matrix might share a cacheline at their endpoints.

Memory management is automatic; no need to free anything.

If you plan on doing a lot of basic linear algebra operations (BLAS) then I advice you NOT to use arrays of arrays for multidimensional arrays. Unfortunately the C syntax for multidimensional static and dynamic arrays have broken symmetry. This has led Java and C# to use the same syntax so that people often think when they allocate a 2D array they are getting the same thing as a static array but from the heap instead of the stack. Your 2D arrays are actually Jagged arrays.

From my experience with BLAS operations (as well as image processing) you want contiguous chunks of memory for 2D arrays. So for a nxn matrix you should allocated like double[n*n] and access it as data[i*n+j] or create a C++ matrix class which where you can access the matrix like matrix.get(i,j).

Don't allocate the memory in parallel. Most BLAS operations are going to be memory bound anyway so OpenMP is only going to be useful for Level-3 BLAS operations such as matrix multiplication, LU factorization, cholesky decomposition which go as O(n^3).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top