You are launching your GPU kernel with multiple blocks. While threads in one block are still computing their sum
in the first for
loop, some other blocks might be doing parallel reduction in the second for
loop. These two for
loops have data dependency. Because the scheduling of blocks over Streaming Multiprocessors happens behind the scene and may vary from one run to another, you get different results each time. __syncthreads();
between for
loops synchronizes threads inside the block but there's no mechanism or instruction for synchronization between multiple blocks unless return to host and issue another kernel.
In your case even if you simply separate for
loops the results you'll get still can be wrong because your reduction happens across multiple blocks and again scheduling of blocks are not deterministic.