Question

I'm writing CUDA kernel and threads are performing following tasks :

for example i have array of [1, 2, 3, 4] then I want answer [12, 13, 14, 23, 24, 34]

Suppose I've an array with n integers and i've two indexes i and j.

simple solution for that in C language will be :

k=0;
for (i = 0; i < n - 1; i++)
    for(j = i+1; j < n-1 ; j++)
       { new_array[k] = array[i]*10 + array[j];
          k++;
        }

In CUDA I've tried my luck :

for(i = threadIdx.x + 1; i < n-1; i++ )
    new_array[i] = array[threadIdx.x] * 10 + array[i];

But I think this is not totally correct or optimal way to do this. can anyone suggest anything better?

Was it helpful?

Solution

I'm assuming that the code you want to port to CUDA is the following:

#include <stdio.h>
#define N 7

int main(){

    int array[N] = { 1, 2, 3, 4, 5, 6, 7};
    int new_array[(N-1)*N/2] = { 0 };

    int k=0;
    for (int i = 0; i < N; i++)
        for(int j = i+1; j < N; j++)
        { 
            new_array[k] = array[i]*10 + array[j];
            k++;
        }

    for (int i = 0; i < (N-1)*N/2; i++) printf("new_array[%d] = %d\n", i, new_array[i]);

return 0;

}

You may wish to note that you can recast the interior loop as

for (int i = 0; i < N; i++)
    for(int j = i+1; j < N; j++)
        new_array[i*N+(j-(i+1))-(i)*(i+1)/2] = array[i]*10 + array[j];

which will avoid the explicit definition of an index variable k by directly using index i*N+(j-(i+1))-(i)*(i+1)/2. Such an observation is useful becuase, if you interpret the indices i and j as thread indices in the ported code, then you will have a mapping between the 2d thread indices and the index needed to access the target array in the __global__ function you have to define.

Accordingly, the ported code is

#include <stdio.h>
#define N 7

__global__ void kernel(int* new_array_d, int* array_d) {

    int i = threadIdx.x;
    int j = threadIdx.y;

    if (j > i) new_array_d[i*N+(j-(i+1))-(i)*(i+1)/2] = array_d[i]*10 + array_d[j];
}

int main(){

    int array[N] = { 1, 2, 3, 4, 5, 6, 7};
    int new_array[(N-1)*N/2] = { 0 };

    int* array_d;       cudaMalloc((void**)&array_d,N*sizeof(int));
    int* new_array_d;   cudaMalloc((void**)&new_array_d,(N-1)*N/2*sizeof(int));

    cudaMemcpy(array_d,array,N*sizeof(int),cudaMemcpyHostToDevice);

    dim3 grid(1,1);
    dim3 block(N,N);
    kernel<<<grid,block>>>(new_array_d,array_d);

    cudaMemcpy(new_array,new_array_d,(N-1)*N/2*sizeof(int),cudaMemcpyDeviceToHost);

    for (int i = 0; i < (N-1)*N/2; i++) printf("new_array[%d] = %d\n", i, new_array[i]);

    return 0;
}

Please, add your own CUDA error check in the sense of What is the canonical way to check for errors using the CUDA runtime API?. Also, you may wish to extend the above CUDA code to the case of block grids of non-unitary sizes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top