GPGPU - CUDA: global store efficiency

https://stackoverflow.com/questions/11151645

16-06-2021
|

Question

I am trying to figure out how well the global memory write accesses of one of my kernels are coalesced, based on the "global store efficiency" value of NVidia's profiler (I am using CUDA 5 toolkit preview release, on a Fermi GPU).

As far as I understood, this value is the ratio of requested memory transactions to actual nb of transcations performed, therefore reflects whether accesses are all perfectly coalesced (100% efficiency) or not.

Now, for a thread block width of 32, and taking float values as input and output, the following test kernel gives 100% efficiency both for global load and for global store, as expected:

__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
  unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
  unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
  int offset = y*pitch+x;
  float tmp = input[offset];
  output[offset] = tmp;
}

What I don't understand is why when I start adding useful code in between the input read and the output write, the global store efficiency begins to drop, whereas I have not changed the memory write pattern or the thread block geometry ? The global load stays at 100%, as I expect, though.

Could someone please shed a light on why this happens ? I thought, since all 32 threads in a given warp execute the output store instruction simultaneously (by definition) and using a "coalescing-friendly" pattern, I should still get 100% whatever I do before, but obviously I must be misunderstanding something on either the meaning of global store efficiency, or on the conditions for global store coalescing.

Thx,

EDIT :

Here is an example: if I use this code (just adding a "round" operation on input), global store efficiency drops from 100% to 95%

__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
  unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
  unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
  int offset = y*pitch+x;
  float tmp = round(input[offset]);
  output[offset] = tmp;
}

Solution 2

Ok, shame on me, I found the problem: I was profiling this simple test code in Debug mode, which gives completely wild numbers for most metrics. Re-profiling in Release mode gave me the expected result : 100% store efficiency in both cases.

OTHER TIPS

Unsure if this is the case, but round probably converts its argument to a double and if there is a register spilling, then each thread would access 8 bytes of memory, which would then be coerced into 4 bytes of tmp. Accessing 8 bytes would reduce the coalescing to half-warp.

However, I believe register spilling shouldn't happen since the number of local variables in your kernel is small. You could check with nvcc --ptxas-options=-v for the spill.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow