Question

Suppose, I declare a local variable in a CUDA kernel function for each thread:

float f = ...; // some calculations here

Suppose also, that the declared variable was placed by a compiler to a local memory (which is the same as global one except it is visible for one thread only as far as I know). My question is will the access to f be coalesced when reading it?

Was it helpful?

Solution

I don't believe there is official documentation of how local memory (or stack on Fermi) is laid out in memory, but I am pretty certain that mulitprocessor allocations are accessed in a "striped" fashion so that non-diverging threads in the same warp will get coalesced access to local memory. On Fermi, local memory is also cached using the same L1/L2 access mechanism as global memory.

OTHER TIPS

CUDA cards don't have memory allocated for local variables. All local variables are stored in registers. Complex kernels with lots of variables reduce the number of threads that can run concurrently, a condition known as low occupancy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top