CUDA Reduction - atomic vs single thread summation

https://stackoverflow.com/questions/17830460

04-06-2022
|

Question

I've recently tested the algorithm for reduction using CUDA (the one you can find for example at http://www.cuvilib.com/Reduction.pdf, page 16). But at the end of it, I ran into trouble not using atomicity. So basically I do the sum of each block and store it into shared array. Then I get it back to the global array x (tdx is threadIndex.x, and i is global index).

if(i==0){
        *sum = 0.; // Initialize to 0
    }
__syncthreads();
if (tdx == 0){       
    x[blockIdx.x] = s_x[tdx]; //get the shared sums in global memory
}
__syncthreads();

Then I want to sum the first x elements (as many as I have blocks). When doing with atomicity it works fine (same result as the cpu), however when I use the commented line below it does not work and often yields "nan":

if(i == 0){    
    for(int k = 0; k < gridDim.x; k++){
        atomicAdd(sum, x[k]); //Works good
       //sum[0] += x[k]; //or *sum += x[k]; //Does not work, often results in nan
    }
}

Now in fact I use atomicadd directly to sum the shared sums, but I would like to understand why this does not work. An atomic add is quite of nonsense when restricting the operation to a single thread. And the simple sum should work fine!

Solution

__syncthreads() only synchronizes threads in the same block, not across different blocks and CUDA has no safe synchronization mechanism across blocks.

The incorrect result is due to a synchronization problem. The operands x[k] are the outcomes of the computations from different blocks: x[0] is the result from block 0, x[1] is the result from block 1, etc. Thread 0 could start adding them up before some blocks have really finished their computations.

You should put the second code snippet in a different kernel, so that synchronization is enforced, and the line sum[0] += x[k]; can now work.

OTHER TIPS

As has been pointed out, your problem is due to missing synchronisation after the first pass since you cannot synchronise between blocks. There is a good walkthrough on reduction in the sample codes provided with the toolkit.

Having said that, I would strongly recommend that people don't write reduction kernels (or other primitives such as scan) where such primitives exist in library code. Much better to invest your effort elsewhere and reuse existing optimised code where it is available. This doesn't apply if you're doing this to learn of course!

I recommend you take a look at Thrust and CUB.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow