Question

I have very basic question which i fail to understand after going through documents. I am facing this issue while executing one of my project as the output i get is totally corrupted and i believe problem is either with memory allocation or with thread sync. ok the question is: Can every thread creates separate copy of all the variables and pointers passed to the kernal function ? or it just creates copy of variable but the pointers we pass that memory is shared amoung all threads. e.g.

int main()
{
  const int DC4_SIZE = 3;
  const int DC4_BYTES = DC4_SIZE * sizeof(float);
  float * dDC4_in;
  float * dDC4_out;
  float hDC4_out[DC4_SIZE];
  float hDC4_out[DC4_SIZE];

  gpuErrchk(cudaMalloc((void**) &dDC4_in, DC4_BYTES));
  gpuErrchk(cudaMalloc((void**) &dDC4_out, DC4_BYTES));
  // dc4 initialization function on host which allocates some values to DC4[] array

  gpuErrchk(cudaMemcpy(dDC4_in, hDC4_in, DC4_BYTES, cudaMemcpyHostToDevice));
  mykernel<<<10,128>>>(VolDepth,dDC4_in);
  cudaMemcpy(hDC4_out, dDC4_out, DC4_BYTES, cudaMemcpyDeviceToHost);
}

__global__ void mykernel(float VolDepth,float * dDC4_in,float * dDC4_out)
{
    for(int index =0 to end)
      dDC4_out[index]=dDC4_in[index] * VolDepth;
}

so i am passing dDC4_in and dDC4_out pointers to GPU with dDC4_in initialized with some values and computing dDC4_out and copying back to host, so does my all 1280 threads will have separate dDC4_in/out copies or they all will work on same copy on GPU overwriting the values of other threads?

Was it helpful?

Solution

global memory is shared by all threads in a grid. The parameters you pass to your kernel (that you've allocated with cudaMalloc) are in the global memory space.

Threads do have their own memory (local memory), but in your example dDC4_in and dDC4_out are shared by all of your threads.

As a general run-down (taken from the CUDA Best Practices documentation):

enter image description here

On the DRAM side: Local memory (and registers) is per-thread, shared memory is per-block, and global, constant, and texture are per-grid.

In addition, global/constant/texture memory can be read and modified on the host, while local and shared memory are only around for the duration of your kernel. That is, if you have some important information in your local or shared memory and your kernel finishes, that memory is reclaimed and your information lost. Also, this means that the only way to get data into your kernel from the host is via global/constant/texture memory.

Anyways, in your case it's a bit hard to recommend how to fix your code, because you don't take threads into account at all. Not only that, in the code you posted, you're only passing 2 arguments to your kernel (which takes 3 parameters), so it's no surprise your results are somewhat lacking. Even if your code were valid, you would have every thread looping from 0 to end and writing the to the same location in memory (which would be serialized, but you wouldn't know which write would be the last one to go through). In addition to that race condition, you have every thread doing the same computation; each of your 1280 threads will execute that for loop and perform the same steps. You have to decide on a mapping of threads to data elements, divide up the work in your kernel based on your thread to element mapping, and perform your computation based on that.

e.g. if you have a 1 thread : 1 element mapping,

__global__ void mykernel(float VolDepth,float * dDC4_in,float * dDC4_out)
{
    int index = threadIdx.x + blockIdx.x*blockDim.x;
    dDC4_out[index]=dDC4_in[index] * VolDepth;
}

of course this would also necessitate changing your kernel launch configuration to have the correct number of threads, and if the threads and elements aren't exact multiples, you'll want some added bounds checking in your kernel.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top