Local, global, constant & shared memory

https://stackoverflow.com/questions/11784975

24-06-2021
|

Question

I read some CUDA documentation that refers to local memory. (It is mostly the early documentation.) The device-properties reports a local-mem size (per thread). What does 'local' memory mean? What is 'local' memory? Where is 'local' memory? How do I access 'local' mem? It is __device__ memory, no?

The device-properties also reports: global, shared, & constant mem size. Are these statements correct: Global memory is __device__ memory. It has grid scope, and a lifetime of the grid (kernel). Constant memory is __device__ __constant__ memory. It has grid scope & a lifetime of the grid (kernel). Shared memory is __device__ __shared__ memory. It has single block scope & a lifetime of that block (of threads).

I'm thinking shared mem is SM memory. i.e. Memory that only that single SM had direct access to. A resource that is rather limited. Isn't an SM assigned a bunch of blocks at a time? Does this mean an SM can interleave the execution of different blocks (or not)? i.e. Run block*A* threads until they stall. Then run block*B* threads until they stall. Then swap back to block*A* threads again. OR Does the SM run a set of threads for block*A* until they stall. Then another set of block*A* threads are swapped in. This swap continues until block*A* is exhausted. Then and only then does work begin on block*B*. I ask because of shared memory. If a single SM is swapping code in from 2 different blocks, then how does the SM quickly swap in/out the shared memory chunks? (I'm thinking the later senerio is true, and there is no swapping in/out of shared memory space. Block*A* runs until completion, then block*B* starts execution. Note: block*A* could be a different kernel than block*B*.)

Solution

From the CUDA C Programming Guide section 5.3.2.2, we see that local memory is used in several circumstances:

When each thread has some arrays but their size is not known at compile time (so they might not fit in the registers)
When the size of the arrays are known at compile time, and this size is too big for register memory (this can also happen with big structs)
When the kernel has already used up all the register memory (so if we have filled the registers with n ints, the n+1th int will go into local memory) - this last case is register spilling, and it should be avoided, because:

"Local" memory actually lives in the global memory space, which means reads and writes to it are comparatively slow compared to register and shared memory. You'll access local memory every time you use some variable, array, etc in the kernel that doesn't fit in the registers, isn't shared memory, and wasn't passed as global memory. You don't have to do anything explicit to use it - in fact you should try to minimize its use, since registers and shared memory are much faster.

Edit: Re: shared memory, you cannot have two blocks exchanging shared memory or looking at each others' shared memory. Since the order of execution of blocks is not guaranteed, if you tried to do this you might tie up a SMP for hours waiting for another block to get executed. Similarly, two kernels running on the device at the same time can't see each others' memory UNLESS it is global memory, and even then you're playing with fire (of race conditions). As far as I am aware, blocks/kernels can't really send "messages" to each other. Your scenario doesn't really make sense since order of execution for the blocks will be different every time and it's bad practice to stall a block waiting for another.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow