Solving collisions - try to coalesce gmem access, using smem, but banks conflicts

Question

Disclaimer

Note that this answer contains more questions than answers. Also note that I'm guessing a lot because I don't get huge parts of your question and source code.

Reconstruction

So I'm guessing that your global memory is an array of Circle structs. You seem to have optimized loading these circles by loading each of their floats separately into shared memory. This way you get continuous access patterns instead of strided ones. Am I still correct here?

So now that you have loaded blockDim.x circles into shared memory cooperatively you want to read a circle c from it for each thread, You seem to have tried 3 different ways:

loading c from strided shared memory
(c.prevX = smem[threadIdx.x + blockDim.x * 2];, etc.)
loading c directly from shared memory
(c = *((Circle*)(smem + threadIdx * SMEM));)
loading c directly from global memory
(c = cOut[j];)

Still correct?

Evaluation

doesn't make any sense when you load circles into shared memory like the way I described before. So you probably have tried a different loading pattern there. Something along the lines of [threadId.x * 8 + 0] as noted in your comment. This solution has the benefit of continuous global access but storing into smem using ank conflicts.
is no better because it has bank conflict when reading into registers.
is worse because of strided global memory access.

Answer

Bank conflicts are easily resolved by inserting dummy values. Instead of using [threadId.x * 8 + 0] you would use [threadId.x * 9 + 0]. Note that you are wasting a bit of shared memory (i.e every ninth float) to spread out the data across banks. Note that you have to do the same when loading the data into shared memory in the first place. But notice that you are still doing a lot of work to just load these Circle structs there. Which leads me to an

Even better answer

Just don't use an array of Circle structs in global memory. Invert your memory pattern by using multiple arrays of float instead. One for each component of a Circle. You can then simply load into registers directly.

c.x = gmem_x[j];
c.y = gmem_y[j];
...

No more shared memory at all, less registers due to less pointer calculation, continuous global access patterns, no bank conflicts. All of it for free!

Now you might think there's a downside to it when preparing the data on the host and getting the results back. My best (and final) guess is that it will still be much faster overall because you'll probably either launch the kernel every frame and visualize with a shader without ever transferring the data back to the host or launch the kernel multiple times in a row before downloading the results. Correct?