__syncthreads()
only synchronizes threads in the same block, not across different blocks and CUDA has no safe synchronization mechanism across blocks.
The incorrect result is due to a synchronization problem. The operands x[k]
are the outcomes of the computations from different blocks: x[0]
is the result from block 0
, x[1]
is the result from block 1
, etc. Thread 0
could start adding them up before some blocks have really finished their computations.
You should put the second code snippet in a different kernel, so that synchronization is enforced, and the line sum[0] += x[k];
can now work.