Question

I have been working on a game of life implementation with cuda. I want to find the array index of each element so i can calculate the neighbours for that element and so that i can write the new value to that array. All the things i found about this are working with pointers to rows and i just cant figure it out how this exactly translates to indexes. To give a better idea of what i mean i have to following code (some snippets):

#define THREADSPERBLOCK 1024
lifeState *d_gameStateInitial;
size_t d_gameStateInitial_pitch;
int sizeX = 100;
int sizeY = 100;

int numBlocks = ((sizeX * sizeY) % THREADSPERBLOCK) + 1;
int numThreadsPerBlock;

if(numBlocks == 1)
{
    numThreadsPerBlock = sizeX * sizeY;
}
else
{
    numThreadsPerBlock = THREADSPERBLOCK;
}

cudaMallocPitch((void **)&d_gameStateInitial, &d_gameStateInitial_pitch, sizeX * sizeof(lifeState), sizeY);

doTheGame<<<numBlocks, numThreadsPerBlock>>>(d_gameStateInitial, d_gameStateInitial_pitch, d_gameStateNew, d_gameStateNew_pitch, sizeX, sizeY);

The "lifestate *" is simply a struct containing an dead/alive enum. Both arrays, the initial and new ones are malloc'd exactly the same way. In the doTheGame kernel i now want to know how to calculate the index, i was thinking about something like this but i think it is wrong:

__global__ void doTheGame(lifeState *initialArray, size_t initialArrayPitch,
                      lifeState *newArray, size_t newArrayPitch,
                      int sizeX, int sizeY)
{
int initialArrayThreadIndex = (blockIdx.x * initialArrayPitch) + threadIdx.x;
int newArrayThreadIndex = (blockIdx.x * initialArrayPitch) + threadIdx.x;
}

Everything i found thus far are basically all the same as the cudaMallocPitch example:

  T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

But i just cant see how that translates to blocks, threads and x and y exactly.

Thanks in advance.

Was it helpful?

Solution

Suppose I have a double data[] array like this:

A  B  C  D  x  x  x  x
E  F  G  H  x  x  x  x

Then data[0] = A, data[1] = B etc.

This might be the result of allocating a 2x4 array with a pitch allocation of 64 bytes (thus the 4 extra x entries in each row). Let's suppose each element in the above array is a double quantity.

Now suppose I have a kernel, and I am launching an array of 2x4 threads, one per valid element (the x elements are not valid - they are the extra allocations to satisfy the pitch requirement, which is arbitrarily chosen for this example to be 64 bytes).

In this kernel, I could create an x and y index as follows:

int idx = threadIdx.x +blockDim.x * blockIdx.x;
int idy = threadIdx.y +blockDim.y * blockIdx.y;

Each thread can then access its respective element as follows.

Since the pitch quantity returned by cudaMallocPitch is in bytes, we need to compute a row offset first, using char pointer arithmetic:

double *rowptr = (double *)((char *)data + idy*pitch);

We can then access an element on that row like this:

rowptr[idx] = 0.0;

If I want a thread to access an element other than the one corresponding to its thread indices, the calculations are similar.

For example to set element G (i.e. element (1,2)) to zero in the above data array, I could do:

double *rowptr = (double *)((char *)data + 1*pitch);
rowptr[2] = 0.0;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top