Question

I wonder if anyone could advise on storage of large (say 2000 x 2000 x 2000) 3D arrays for finite difference discretization computations. Does contiguous storage float* give better performance then float*** on modern CPU architectures? Here is a simplified example of computations, which are done over entire arrays:

for i ...
    for j ...
        for k ...
            u[i][j][k] += v[i][j][k+1] + v[i][j][k-1]
                + v[i][j+1][k] + v[i][j-1][k] + v[i+1][j][k] + v[i-1][j][k];

Vs

    u[i * iStride + j * jStride + k] += ...

PS: Considering size of problems, storing T*** is a very small overhead. Access is not random. Moreover, I do loop blocking to minimize cache misses. I am just wondering how triple dereferencing in T*** case compares to index computation and single dereferencing in case of 1D array.

Was it helpful?

Solution

These are not apples-to-apples comparisons: a flat array is just that - a flat array, which your code partitions into segments according to some logic of linearizing a rectangular 3D array. You access an element of an array with a single dereference, plus a handful of math operations.

float***, on the other hand, lets you keep a "jagged" array of arrays or arrays, so the structure that you can represent inside such an array is a lot more flexible. Naturally, you need to pay for that flexibility with additional CPU cycles required for dereferencing pointers to pointers to pointers, then a pointer to pointer, and finally a pointer (the three pairs of square brackets in the code).

Naturally, access to the individual elements of float*** is going to be a little slower, if you access them in truly random order. However, if the order is not random, the difference that you see may be tiny, because the values of pointers would be cached.

float*** will also require more memory, because you need to allocate two additional levels of pointers.

OTHER TIPS

The short answer is: benchmark it. If the results are inconclusive, it means it doesn't matter. Do what makes your code most readable.

As @dasblinkenlight has pointed out, the structures are nor equivalent because T*** can be jagged.

At the most fundamental level, though, this comes down to the arithmetic and memory access operations.

For your 1D array, as you have already (almost) written, the calculation is:

 ptr = u + (i * iStride) + (j * jStride) + k
 read *ptr

With T***:

 ptr = u + i
 x = read ptr
 ptr = x + j
 y = read ptr
 ptr = y + k
 read ptr

So you are trading two multiplications for two memory accesses.

In computer go, where people are very performance-sensitive, everyone (AFAIK) uses T[361] rather than T[19][19] (*). This decision is based on benchmarking, both in isolation and the whole program. (It is possible everyone did those benchmarks years and years ago, and have never done them again on the latest hardware, but my hunch is that a single 1-D array will still be better.)

However your array is huge, in comparison. As the code involved in each case is easy to write, I would definitely try it both ways and benchmark.


*: Aside: I think it is actually T[21][21] vs. t[441], in most programs, as an extra row all around is added to speed up board-edge detection.

One issue that has not been mentioned yet is aliasing.

Does your compiler support some type of keyword like restrict to indicate that you have no aliasing? (It's not part of C++11 so would have to be an extension.) If so, performance may be very close to the same. If not, there could be significant differences in some cases. The issue will be with something like:

for (int i = ...) {
    for (int j = ...) {
        a[j] = b[i];
    }
}

Can b[i] be loaded once per outer loop iteration and stored in a register for the entire inner loop? In the general case, only if the arrays don't overlap. How does the compiler know? It needs some type of restrict keyword.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top