CUDA / OpenCL cache coherence, locality and space-filling curves

https://stackoverflow.com/questions/16761092

30-05-2022
|

Question

I'm working on a CUDA app that makes use of all available RAM on the card, and am trying to figure out different ways to reduce cache misses.

The problem domain consists of a large 2- or 3-D grid, depending on the type of problem being solved. (For those interested, it's an FDTD simulator). Each element depends on either two or four elements in "parallel" arrays (that is, another array of nearly identical dimensions), so the kernels must access either three or six different arrays.

The Problem

*Hopefully this isn't "too localized". Feel free to edit the question

The relationship between the three arrays can be visualized as (apologize for the mediocre ASCII art)

  A[0,0] -C[0,0]- A ---- C ---- A ---- C ---- A
    |             |             |             |
    |             |             |             |
  B[0,0]          B             B             B
    |             |             |             |
    |             |             |             |
    A ---- C ---- A ---- C ---- A ---- C ---- A
    |             |             |             |
    |             |             |             |
    B             B             B             B
    |             |             |             |
    |             |             |             |
    A ---- C ---- A ---- C ---- A ---- C ---- A
    |             |             |             |
    |             |             |             |
    B             B             B             B[3,2]
    |             |             |             |
    |             |             |             |
    A ---- C ---- A ---- C ---- A ---- C ---- A[3,3]
                                      [2,3]

Items connected by lines are coupled. As can be seen above, A[] depends on both B[] and C[], while B[] depends only on A[], as does C[]. All of A[] is updated in the first kernel, and all of B[] and C[] are updated in a second pass.

If I declare these arrays as simple 2D arrays, I wind up with strided memory access. For a very large domain size (3x3 +- 1 in the grid above), this causes occupancy and performance deficiencies.

So, I thought about rearranging the array layout in a Z-order curve:

Z-order space filling curve

Also, it would be fairly trivial to interleave these into one array, which should improve fetch performance since (depending on the interleave order) at least half of the elements required for a given cell update would be close to one another. However, it's not clear to me if GPU uses multiple data pointers when accessing multiple arrays. If so, this imagined benefit could actually be a hindrance.

The Questions

I've read that NVidia does this automatically behind the scenes when using texture memory, or a cudaArray. If this is not the case, should I expect the increased latency when crossing large spans (when the Z curve goes from upper right to bottom left at a high subdivision level) to eliminate the benefit of the locality in smaller grids?

Dividing the grid into smaller blocks that can fit in shared memory should certainly help, and the Z order makes this fairly trivial. Should I have a separate kernel pass that updates boundaries between blocks? Will the overhead of launching another kernel be significant compared to the savings I expect ?
Is there any real benefit to using a 2D vs 1D array? I expect memory to be linear, but am unsure if there is any real meaning to the 2D memory layout metaphor that's often used in CUDA literature.

Wow - long question. Thanks for reading and answering any/all of this.

Solution

Just to get this off of the unanswered list:

After a lot of benchmarking and playing with different arrangements, the fastest approach I found was to keep the arrays interleaved in z-order so that most of the values required by a thread were located near each other in RAM. This improved cache behavior (and thus performance). Obviously there are many cases where Z order fails to keep required values close together. I wonder if rotating quadrants to reduce "distance" between the end of a Z and the next quadrant, but I haven't tried that.

Thanks to everyone for the advice.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow