Question

I am executing an 320*320 array multiplication using CUDA on a gpu. I have observed that a fixed amount of memory is used which is unaccounted for.For example in 640*640 array multiplication considering each element occupies 4 bytes and we have 3 such arrays in the code, approx 5 MB of GPU memory should be consumed. But when I check it with the nvidia-smi command it shows me 53MB as consumed. This 48 MB is unaccounted for. The same case is true for 1200*1200 or any other possible size.

Was it helpful?

Solution

The CUDA driver maintains numerous device memory allocations including but not limited to

  1. Local Memory
    • Size = (user specified lmem size per thread + driver specified syscall stack) * MultiprocessorCount * MaxThreadsPerMultiprocessor.
    • Example - 15 SM GK110
      • 15 Multiprocessors
      • 2048 MaxThreadsPerMultiprocessor
      • 2048 bytes per thread (cudaLimitStackSize)
      • 512 bytes per thread for syscall stack
      • SIZE = 15 * 2048 * (2048 + 512) = 78,643,200 bytes
  2. Printf FIFO
  3. Malloc Heap
  4. Constant Buffers
    • Driver allocates multiple constant buffers per stream. These are used to pass launch configuration and launch parameters, module constants, and constant variables. The PTX manual has additional information on constant buffers.
  5. CUDA Dynamic Parallelism Buffers

The driver defers creation of these buffers until necessary. This often means that the memory allocation will be done in one of the API calls to launch a kernel.

Items 1, 2, and 3 can be controlled to some extent through cudaDeviceSetLimit.

Item 4 grows linearly with number of streams allocated and modules loaded. At a different point for each architecture the driver will start aliasing stream constant buffers to limit the resource allocations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top