Question

I need to permute the elements of an array (row of a matrix) with parallel programming in CUDA.

My attempt is the following

__global__ void CudaProcessingKernel(int *dataA) 
{

    int bx = blockIdx.x;  
    int tx = threadIdx.x;  
    int tid = bx * XTHREADS + tx;  

    if(tid< 16)     // matrix 4x4
    {

        if(tid==4)  dataB[tid]=dataB[5];
        if(tid==5)  dataB[tid]=dataB[6];
        if(tid==6)  dataB[tid]=dataB[7];
        if(tid==7)  dataB[tid]=dataB[4];
    }

    __syncthreads();
}           

The above approach doesn't work.

What I want:

input B[1][]  =  **EC 6E 4C 90** => output **6E 4C 90 EC**

My output is:

**6E 90 4C 6E**

B[4] and B[7] have same value. I think that, in the case of parallel processing, I should be able to avoid the temporary storage: int TEMP = B[4]; B[7]=TEMP.

What is my fault? What kind of memory I have to use?

Was it helpful?

Solution

I would suggest using atomic operations to avoid the race condition whose effect you are observing. As atomic operation, you can use atomicCAS. The approach below will work for any grid size. As an alternative approach, you can define a new, temporary array (as you mentioned in your post) and avoid atomic operations.

Here is the code using CUDA atomics.

#include <stdio.h>

#define N 10

__global__ void swap_kernel(int* data) {

    int tid = blockIdx.x*blockDim.x + threadIdx.x;

    if (tid < N) { atomicCAS(&data[tid], data[tid], data[(tid+1) % N]); }

}

void main() {

    int* h_vec = (int*) malloc(N*sizeof(int));
    int* d_vec; cudaMalloc((void**)&d_vec,N*sizeof(int));

    for (int i=0; i<N; i++) h_vec[i] = i;

    cudaMemcpy(d_vec,h_vec,N*sizeof(int),cudaMemcpyHostToDevice);

    swap_kernel<<<2,8>>>(d_vec);

    cudaMemcpy(h_vec,d_vec,N*sizeof(int),cudaMemcpyDeviceToHost);

    for (int i=0; i<N; i++) printf("%i %i\n",i,h_vec[i]);

    getchar();

} 

OTHER TIPS

You are modifying data in global while other are reading so the output is wrong. You should do something like that. Like you read your data and once every thread has the data registered you write into the new element of your array. Like:

[..]
int local = dataB[indexToSwap+tid];

__syncthreads();

dataB[indexSwap+tid] = local;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top