What is the memory requirement for a CUDA kernel execution?

Question

The CUDA driver maintains numerous device memory allocations including but not limited to

Local Memory
- Size = (user specified lmem size per thread + driver specified syscall stack) * MultiprocessorCount * MaxThreadsPerMultiprocessor.
- Example - 15 SM GK110
  - 15 Multiprocessors
  - 2048 MaxThreadsPerMultiprocessor
  - 2048 bytes per thread (cudaLimitStackSize)
  - 512 bytes per thread for syscall stack
  - SIZE = 15 * 2048 * (2048 + 512) = 78,643,200 bytes
Printf FIFO
Malloc Heap
Constant Buffers
- Driver allocates multiple constant buffers per stream. These are used to pass launch configuration and launch parameters, module constants, and constant variables. The PTX manual has additional information on constant buffers.
CUDA Dynamic Parallelism Buffers

The driver defers creation of these buffers until necessary. This often means that the memory allocation will be done in one of the API calls to launch a kernel.

Items 1, 2, and 3 can be controlled to some extent through cudaDeviceSetLimit.

Item 4 grows linearly with number of streams allocated and modules loaded. At a different point for each architecture the driver will start aliasing stream constant buffers to limit the resource allocations.