Is incomplete global memory access coalesced?

Question

If threadId is computed correctly as documented in cuda programming guide - thread hierachy, than this access will be coalesced - this will be the case for threadId = threadIdx.x.

For the different compute architectures memory coalescing differs slightly. More details can be found at appendix G of cuda programming guide.

In general you can say that global memory accesses are coalesced, if your threads grab consecutive elements in the memory, starting from the address of the element that your first thread access.

Let's assume you have a float array.
float array[]
and your memory acces looks in that way

array[threadIdx.x == 0, threadId.x == 1, threadIdx.x == 2, ..., threadIdx.x == 31]

than your access will be coalesed.

But if you access memory in that way (interleaved for example)

array[threadIdx.x == 0, NONE, threadId.x == 1, NONE, threadIdx.x == 2, ..., NONE,  threadIdx.x == 31]

than your access isn't coalesced (NONE means that this array element isn't access by any thread)

In the first case you grab 128 consecutive bytes memory. In the second case you grab 256 bytes. For the second case two warps are required to load the memory from global memory instead of one warp for the first case. But in both cases only 32 float elements (that are 128 bytes) are required for the following computations. So your global load-rate will drop from 1.0 to 0.5 in that simple case.