Is it worthwhile to pass kernel parameters via shared memory?

https://stackoverflow.com/questions/16754885

30-05-2022
|

Pergunta

Suppose that we have an array int * data, each thread will access one element of this array. Since this array will be shared among all threads it will be saved inside the global memory.

Let's create a test kernel:

 __global__ void test(int *data, int a, int b, int c){ ... }

I know for sure that the data array will be in global memory because I allocated memory for this array using cudaMalloc. Now as for the other variables, I've seen some examples that pass an integer without allocating memory, immediately to the kernel function. In my case such variables are a b and c.

If I'm not mistaken, even though we do not call directly cudaMalloc to allocate 4 bytes for each three integers, CUDA will automatically do it for us, so in the end the variables a b and c will be allocated in the global memory.

Now these variables, are only auxiliary, the threads only read them and nothing else.

My question is, wouldn't it be better to transfer these variables to the shared memory?

I imagine that if we had for example 10 blocks with 1024 threads, we would need 10*3 = 30 reads of 4 bytes in order to store the numbers in the shared memory of each block.

Without shared memory and if each thread has to read all these three variables once, the total amount of global memory reads will be 1024*10*3 = 30720 which is very inefficient.

Now here is the problem, I'm somewhat new to CUDA and I'm not sure if it's possible to transfer the memory for variables a b and c to the shared memory of each block without having each thread reading these variables from the global memory and loading them to the shared memory, so in the end the total amount of global memory reads would be 1024*10*3 = 30720 and not 10*3 = 30.

On the following website there is this example:

 __global__ void staticReverse(int *d, int n)
 {
    __shared__ int s[64];
    int t = threadIdx.x;
    int tr = n-t-1;
    s[t] = d[t];
    __syncthreads();
   d[t] = s[tr];
 }

Here each thread loads different data inside the shared variable s. So each thread, according to its index, loads the specified data inside the shared memory.

In my case, I want to load only variables a b and c to the shared memory. These variables are always the same, they don't change, so they don't have anything to do with the threads themselves, they are auxiliary and are being used by each thread to run some algorithm.

How should I approach this problem? Is it possible to achieve this by only doing total_amount_of_blocks*3 global memory reads?

Solução

The GPU runtime already does this optimally without you needing to do anything (and your assumption about how argument passing works in CUDA is incorrect). This is presently what happens:

In compute capability 1.0/1.1/1.2/1.3 devices, kernel arguments are passed by the runtime in shared memory.
In compute capability 2.x/3.x/4.x/5.x/6.x devices, kernel arguments are passed by the runtime in a reserved constant memory bank (which has a dedicated cache with broadcast).

So in your hypothetical kernel

__global__ void test(int *data, int a, int b, int c){ ... }

data, a, b, and c are all passed by value to each block in either shared memory or constant memory (depending on GPU architecture) automatically. There is no advantage in doing what you propose.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow