The new operator + OpenMP dynamic schedule clause

Question 1

Your particular loop probably does not offer much opportunity for parallelism if you are using the default new operator, since the heap is a single resource and access to it needs to be serialized through a mutex. However, assuming you have other loops for which you wish to use OpenMP, the following should help.

From the OpenMP 3.1 specification:

static When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case.

dynamic When schedule(dynamic, chunk_size) is specified, the iterations are distributed to threads in the team in chunks as the threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed.

Each chunk contains chunk_size iterations, except for the last chunk to be distributed, which may have fewer iterations.

When no chunk_size is specified, it defaults to 1.

In your case, you are not specifying chunk_size, so the number of iterations for each task is unspecified.

In general, I prefer to have some control over the number of threads and the number of iterations each task is executing. I found (on Windows, compiled with mingw-w64), there is significant overhead for tasks starting a new chunk of work, so it is beneficial to give them as big of a chunk as possible. What I tend to do is use dynamic (though I could use static for fixed execution time tasks), and set the chunk_size to be the loop count divided by the number of threads. In your case, if you suspect uneven task execution time, you could divide this by 2 or 4.

// At the top of a C++ file:
static int NUM_THREADS = omp_get_num_procs();

// Then for your loop construct (I'm using a combined parallel for here):
#pragma omp parallel for num_threads(NUM_THREADS) \
   schedule(dynamic, row / NUM_THREADS / 2)
for(unsigned int i = 0; i < row; ++i)
{
   data[i] = new double *[column]();
}

Note also that if you do not set num_threads, the default will be nthreads-var, which is determined from omp_get_max_threads.

Regarding the nowait clause, obviously make sure you are not using data outside your loop construct. I'm using a combined parallel loop construct above, which means nowait cannot be specified.

Question 2

There's a much cleaner way to do all of this:

std::vector<double> actual_data(omp_get_num_procs() * column);

std::vector<double *> data(omp_get_num_procs());
for (unsigned i = 0; i < row; i++) {
    data[i] = &(actual_data[i * column]);
}

You now have a single array, allocated in one go, with an array of pointers into that array. Inside parallel algorithms, you can use data[i][j] to get to the member you want, with practically zero-overhead (the overhead occurs at compile-time).

The only potential risk is that of false sharing because the rows of your matrix might share a cacheline at their endpoints.

Memory management is automatic; no need to free anything.

Question 3

If you plan on doing a lot of basic linear algebra operations (BLAS) then I advice you NOT to use arrays of arrays for multidimensional arrays. Unfortunately the C syntax for multidimensional static and dynamic arrays have broken symmetry. This has led Java and C# to use the same syntax so that people often think when they allocate a 2D array they are getting the same thing as a static array but from the heap instead of the stack. Your 2D arrays are actually Jagged arrays.

From my experience with BLAS operations (as well as image processing) you want contiguous chunks of memory for 2D arrays. So for a nxn matrix you should allocated like double[n*n] and access it as data[i*n+j] or create a C++ matrix class which where you can access the matrix like matrix.get(i,j).

Don't allocate the memory in parallel. Most BLAS operations are going to be memory bound anyway so OpenMP is only going to be useful for Level-3 BLAS operations such as matrix multiplication, LU factorization, cholesky decomposition which go as O(n^3).