What happens to a GPU multiprocessor's shared memory between kernel block executions?

StackOverflow https://stackoverflow.com/questions/20577906

  •  01-09-2022
  •  | 
  •  

문제

Suppose I have a CUDA kernel with a bunch of blocks, and suppose a some block is scheduled right after another block on the same symmetric multiprocessor (that is, the unit with the same area of shared memory for all warps). NVIDIA does not, at the moment, specify - either in the API or in per-GPU documents - what happens to the shared memory between executions. In practice, though, which of the following holds regarding the block's shared memory contents? :

  • It's in the same state the last scheduled block left it.
  • It's blank.
  • It contains unforeseeable junk.

To narrow down the variation of what might potentially be the case, please refer specifically to the case in which each block uses the maximum possible amount of shared memory - 48 KB on Kepler GPUs.

도움이 되었습니까?

해결책

NVIDIA does not publish the behavior of the hardware at this level, so you should consider it as undefined (as @datenwolf says). Though, of course, the contents of the shared memory seen by a given block is not going to be random. And there is no point for the hardware to spend time clearing the memory.

The GPU can run many blocks simultaneously on each SM. The number of blocks that run simultaneously for a given kernel depends on various factors. So, for instance, if shared memory is the limiting factor, each SM will run as many blocks as will fit in shared memory. So, if there is 48K of shared memory and a block needs 10K, 4 blocks may run at the same time, using 40K. So, if you have a device with 8 SMs, my guess is that there will be 32 (4 * 8) possible, fixed, locations for the shared memory of a given block. So, when a new block is scheduled, it will be assigned to one of those locations and see the shared memory as it was left by the previous block that ran in that location.

The API provides no way for a block to detect in which location it is running. Scheduling of a block is determined dynamically and is probably very hard to predict.

If the GPU is used for display, it may be running other kernels (shaders) simultaneously, possibly overwriting the shared memory between blocks in the CUDA kernel in weird and wonderful ways. Even CUDA may be running other kernels behind the scenes.

Edit:

I wrote a small program to test things out (included below). The program take the number of integers one block should store in shared memory as an argument. It then launches 100,000 blocks, each with one thread. Each block checks if its shared memory is already initialized. If it's initialized, the block does nothing more. If it's not initialized, the block initializes the memory and increases a global count. The initialization pattern is an increasing sequence of numbers, to avoid having partially overlapping initialized shared memory buffers appear to be valid.

On a GTX660 (Kepler, CC 3.0, 5 SMs), 48K shared memory configured, CC 3.0 Release build, I got the following results:

C:\rd\projects\cpp\test_cuda\Release>test_cuda.exe 10000
Shared memory initializations: 5

I ran this several times and got the same result each time. This matches the guess I made initially as 10000 integers take up ~40K, so there would be room for one concurrent block per SM, and this device has 5 SMs.

However, when I reduced the shared memory to 2500 integers (~10K), expecting to get 20 initializations, and ran it several times, I got different high numbers:

Shared memory initializations: 32,822
Shared memory initializations: 99,996
Shared memory initializations: 35,281
Shared memory initializations: 30,748

So, my guess about the fixed locations is completely invalid in this case.

I then tried reducing shared memory to 100 integers (there would be room for 122 blocks in 48K) and consistently got:

Shared memory initializations: 480

So, again, not the expected number and, surprisingly, there is apparently fewer possible variations even though the amount of shared memory used by each block is smaller.

It looks like, if you are determined to shoot yourself in the foot, you can use a large shared memory block to keep things consistent :) Also, this was run on a GPU that was also used for display, Windows 7 with Aero (A GPU accelerated theme) and it did not seem like rendering would interfere because the desktop freezes while the kernel is running.

Program:

#include "cuda_runtime.h"

#include <iostream>
#include <sstream>
using namespace std;

#define assertCudaSuccess(ans) { _assertCudaSuccess((ans), __FILE__, __LINE__); }
inline void _assertCudaSuccess(cudaError_t code, char *file, int line)
{
  if (code != cudaSuccess) {
    fprintf(stderr,"CUDA Error: %s %s %d\n", cudaGetErrorString(code), file, line);
    exit(code);
  }
}

__global__ void shared_memory_persistence_test(int n_shared_ints);
__device__ int init_cnt_d(0);

int main(int argc, char* argv[])
{
  cout.imbue(locale(""));
  int n_shared_ints;
  stringstream(string(argv[1])) >> n_shared_ints;
  shared_memory_persistence_test<<<dim3(100, 1000), 1, n_shared_ints * sizeof(int)>>>(n_shared_ints);
  assertCudaSuccess(cudaPeekAtLastError());
  assertCudaSuccess(cudaDeviceSynchronize());
  int init_cnt_h;
  assertCudaSuccess(cudaMemcpyFromSymbol(&init_cnt_h, init_cnt_d, sizeof(int), 0, cudaMemcpyDeviceToHost));
  cout << "Shared memory initializations: " << init_cnt_h << endl;
  return 0;
}

__global__ void shared_memory_persistence_test(int n_shared_ints)
{
  extern __shared__ int shared[];

  for (int i(0); i < n_shared_ints; ++i) {
    if (shared[i] != i) {
      for (int i(0); i < n_shared_ints; ++i) {
        shared[i] = i;
      }
      atomicAdd(&init_cnt_d, 1);
      break;
    }
  }
}

다른 팁

The state is undefined. Which mean it could be anything, including the anything of the three things you guessed. But reading from uninitialized memory could lead into your GPU emerge an artificial intelligence as well.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top