Each thread executes the same instruction at the same time (or is idle), so every thread goes into GetSubMatrix
yes. Each thread takes a few items. So if there are N
threads and 3N
items to be copied each thread will copy 3.
For example, if I were copying a vector, I might do the following
float from* = ???;
float to* = ???;
int num = ???;
int thread = threadIdx.x + threadIdx.y*blockDim.x ...; // A linear index
int num_threads = blockDim.x * blockDim.y * blockDim.z;
for(int i=threadIdx.x; i < num; i+= num_threads) {
to[i] = from[i];
}
Every thread is involved in copying one bit at a time. As an aside: if you can manage to get all the threads to copy a sequential bunch of elements you get bonus speed in the copy.