CUDA: Value of sum[0] changes with each execution

https://stackoverflow.com/questions/20961310

25-09-2022
|

Question

 __global__ void gpu_Heat (float *h, float *g, float * sum, int N) {
        int nbx, bx, nby, by;
        float diff = 0.0;
        nbx = (N-2)/blockDim.x;
        bx = nbx/gridDim.x;
        nby = (N-2)/blockDim.y;
        by = nby/gridDim.y;     
        unsigned int ii = blockIdx.x*blockDim.x+threadIdx.x;
        unsigned int jj = blockIdx.y*blockDim.y+threadIdx.y;
        unsigned int jid = (ii)*(N-2)+(jj);
         for (int i=1+ii*bx; i<=min((ii+1)*bx, N-2); i++) 
                    for (int j=1+jj*by; j<=min((jj+1)*by, N-2); j++) {
                    g[i*N+j]= 0.25 * (h[ i*N + (j-1)]+ 
                                 h[ i*N +(j+1) ]+
                                 h[ (i-1)*N + j]+ 
                                 h[ (i+1)*N + j]); 
                        diff = g[i*N+j] - h[i*N+j];
                        sum[(i-1)*(N-2)+(j-1)] = diff * diff;
                }
        __syncthreads();
        for(unsigned int s=((N-2)*(N-2))/2; s>0; s>>=1){    
            if(jid<s){
                sum[jid]+=sum[jid+s];           
            }
            __syncthreads();    

        }
    }

So my problem here is that the value of sum[0], that contains the end result of the parallel reduction changes with each execution of that kernel even though the inputs are the same and I don't know what I am doing wrong. If the same sum matrix is reduced in the CPU the execution is fine but with parallel reduction in GPU it is giving me problems.

    dim3 Grid = (16,16);
    dim3 Block = (16,16);
    gpu_Heat<<<Grid,Block>>>(dev_u, dev_uhelp, dev_sum, np);
    cudaThreadSynchronize();                        // wait for all threads to complete
    cudaErrorCheck(cudaMemcpy(param.u,dev_u,np*np*sizeof(float),cudaMemcpyDeviceToHost));
    cudaErrorCheck(cudaMemcpy(param.uhelp,dev_uhelp,np*np*sizeof(float),cudaMemcpyDeviceToHost));
    cudaErrorCheck(cudaMemcpy(sum,dev_sum,sum_size*sizeof(float),cudaMemcpyDeviceToHost));

I have shown here explicitly the block and grid parameters that are calculate by the code in the test that I am using. Thanks for answering

Solution

You are launching your GPU kernel with multiple blocks. While threads in one block are still computing their sum in the first for loop, some other blocks might be doing parallel reduction in the second for loop. These two for loops have data dependency. Because the scheduling of blocks over Streaming Multiprocessors happens behind the scene and may vary from one run to another, you get different results each time. __syncthreads(); between for loops synchronizes threads inside the block but there's no mechanism or instruction for synchronization between multiple blocks unless return to host and issue another kernel.

In your case even if you simply separate for loops the results you'll get still can be wrong because your reduction happens across multiple blocks and again scheduling of blocks are not deterministic.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow