CUDA: can thread creates separate copy of all the data?

Question

global memory is shared by all threads in a grid. The parameters you pass to your kernel (that you've allocated with cudaMalloc) are in the global memory space.

Threads do have their own memory (local memory), but in your example dDC4_in and dDC4_out are shared by all of your threads.

As a general run-down (taken from the CUDA Best Practices documentation):

enter image description here

On the DRAM side: Local memory (and registers) is per-thread, shared memory is per-block, and global, constant, and texture are per-grid.

In addition, global/constant/texture memory can be read and modified on the host, while local and shared memory are only around for the duration of your kernel. That is, if you have some important information in your local or shared memory and your kernel finishes, that memory is reclaimed and your information lost. Also, this means that the only way to get data into your kernel from the host is via global/constant/texture memory.

Anyways, in your case it's a bit hard to recommend how to fix your code, because you don't take threads into account at all. Not only that, in the code you posted, you're only passing 2 arguments to your kernel (which takes 3 parameters), so it's no surprise your results are somewhat lacking. Even if your code were valid, you would have every thread looping from 0 to end and writing the to the same location in memory (which would be serialized, but you wouldn't know which write would be the last one to go through). In addition to that race condition, you have every thread doing the same computation; each of your 1280 threads will execute that for loop and perform the same steps. You have to decide on a mapping of threads to data elements, divide up the work in your kernel based on your thread to element mapping, and perform your computation based on that.

e.g. if you have a 1 thread : 1 element mapping,

__global__ void mykernel(float VolDepth,float * dDC4_in,float * dDC4_out)
{
    int index = threadIdx.x + blockIdx.x*blockDim.x;
    dDC4_out[index]=dDC4_in[index] * VolDepth;
}

of course this would also necessitate changing your kernel launch configuration to have the correct number of threads, and if the threads and elements aren't exact multiples, you'll want some added bounds checking in your kernel.