Suppose I have a double data[]
array like this:
A B C D x x x x
E F G H x x x x
Then data[0] = A
, data[1] = B
etc.
This might be the result of allocating a 2x4 array with a pitch allocation of 64 bytes (thus the 4 extra x
entries in each row). Let's suppose each element in the above array is a double
quantity.
Now suppose I have a kernel, and I am launching an array of 2x4 threads, one per valid element (the x
elements are not valid - they are the extra allocations to satisfy the pitch requirement, which is arbitrarily chosen for this example to be 64 bytes).
In this kernel, I could create an x and y index as follows:
int idx = threadIdx.x +blockDim.x * blockIdx.x;
int idy = threadIdx.y +blockDim.y * blockIdx.y;
Each thread can then access its respective element as follows.
Since the pitch
quantity returned by cudaMallocPitch
is in bytes, we need to compute a row offset first, using char
pointer arithmetic:
double *rowptr = (double *)((char *)data + idy*pitch);
We can then access an element on that row like this:
rowptr[idx] = 0.0;
If I want a thread to access an element other than the one corresponding to its thread indices, the calculations are similar.
For example to set element G
(i.e. element (1,2)) to zero in the above data
array, I could do:
double *rowptr = (double *)((char *)data + 1*pitch);
rowptr[2] = 0.0;