Your particular loop probably does not offer much opportunity for parallelism if you are using the default new
operator, since the heap is a single resource and access to it needs to be serialized through a mutex. However, assuming you have other loops for which you wish to use OpenMP, the following should help.
From the OpenMP 3.1 specification:
static When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.
When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case.
dynamic When schedule(dynamic, chunk_size) is specified, the iterations are distributed to threads in the team in chunks as the threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed.
Each chunk contains chunk_size iterations, except for the last chunk to be distributed, which may have fewer iterations.
When no chunk_size is specified, it defaults to 1.
In your case, you are not specifying chunk_size, so the number of iterations for each task is unspecified.
In general, I prefer to have some control over the number of threads and the number of iterations each task is executing. I found (on Windows, compiled with mingw-w64), there is significant overhead for tasks starting a new chunk of work, so it is beneficial to give them as big of a chunk as possible. What I tend to do is use dynamic (though I could use static for fixed execution time tasks), and set the chunk_size to be the loop count divided by the number of threads. In your case, if you suspect uneven task execution time, you could divide this by 2 or 4.
// At the top of a C++ file:
static int NUM_THREADS = omp_get_num_procs();
// Then for your loop construct (I'm using a combined parallel for here):
#pragma omp parallel for num_threads(NUM_THREADS) \
schedule(dynamic, row / NUM_THREADS / 2)
for(unsigned int i = 0; i < row; ++i)
{
data[i] = new double *[column]();
}
Note also that if you do not set num_threads, the default will be nthreads-var, which is determined from omp_get_max_threads
.
Regarding the nowait
clause, obviously make sure you are not using data
outside your loop construct. I'm using a combined parallel loop construct above, which means nowait
cannot be specified.