CUDA interthread communication

https://stackoverflow.com/questions/20119103

03-08-2022
|

Question

I am new in CUDA programing, and I have bit of a problem. I was trying to write a program, which needs interthread communication; I tried it on every possible ways I have found, and it is still not working. What do you think, what am I missing?

The code snippet below is my entire program. It starts 2 threads in the very same block. They get an input, and an output array, and another global variable to communicate through. The value 0 means that the variable is empty, and so writable. Basicly the first one reads an element from the input, passes the value to the second one, which writes it into the output array. Later it supposed to be a pipeline, with more threads between A and B.

#include <cuda.h>
#include <cuda_runtime.h>

#include <stdio.h>

#define N 1

__global__ void link(int *in, int *out, int *pipe){ 
    int id = threadIdx.y*blockDim.x + threadIdx.x;  //compute index     

    if(id == 0){        //writer thread

        for(int index = 0;index<N;){                
            if(pipe[0]==0){             
                atomicExch(pipe, in[index++]);              
            }           
        }
    }
    else if(id == 1){   // reader thread    

        for(int index=0;index<N;)   {
            if(pipe[0]!=0){
                out[index++] = atomicExch(pipe, 0); //read and make it empty    
            }           
        }           
    }
}

int main(){
    int input[] = {8,7};
    int *dev_input;
    int *dev_output;
    int *dev_pipe;
    int *output = (int*) malloc (N*sizeof(int));

    cudaMalloc((void**) &dev_input, N*sizeof(int));
    cudaMalloc((void**) &dev_output, N*sizeof(int));
    cudaMalloc((void**) &dev_pipe, 1*sizeof(int));
    cudaMemset(dev_pipe, 0, 1);
    cudaMemcpy(dev_input, &input[0], N*sizeof(int), cudaMemcpyHostToDevice);

    link<<<1, 2>>>(dev_input, dev_output, dev_pipe);

    cudaMemcpy(output, dev_output, N*sizeof(int), cudaMemcpyDeviceToHost);

    printf("[%d", output[0]);
    for(int i = 1;i<N;i++)
        printf(", %d", output[i]);
    printf("]\n");
    int d = 0;
    scanf("\n", &d);

}

If the reader see, that the pipe is 0 (empty), it puts the first element on it, but the writer can not see any changes, and the program goes into a deadlock. I tried to add __threadfence and __syncthreads, but it did not help. I also tried volatile shared memory, but it didn't work either. Please help me, if you can, because I have no idea, what is wrong with it.

Solution

Beware, CUDA threads are much different from POSIX threads. They work according to the Single Instruction Multiple Threads paradigm (SIMT, see this interesting discussion): at each clock tick, every threads (in a single 'wrap') runs the same (low-level) instruction.

In you code, the writer thread will run while the reader thread will perform NOPs and then, the second thread will run while the first one performs NOPs but they will never run simultaneously so you would not benefit from the massively parallel structure of GPUs.

Anyway, to answer your question, your for loops

for(int index=0;index<N;)

do not increment index; so they are infinite loops. Replace with

for(int index=0;index<N;index++)

And share memory, that is common to all threads within the same block, would be much faster than global memory for inter-thread communication.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow