Struggling with intuition regarding how warp-synchronous thread execution works

Question 1

Let's look at the code in blocks, and answer your questions along the way:

int sum = 0;
const int tid = threadIdx.x;
for ( size_t i = blockIdx.x*blockDim.x + tid;
      i < N;
      i += blockDim.x*gridDim.x ) {
    sum += in[i];
}

The above code travels through a data set of size N. An assumption we can make for understanding purposes is that N > blockDim.x*gridDim.x, this last term simply being the total number of threads in the grid. Since N is larger than the total threads, each thread is summing multiple elements from the data set. From the standpoint of a given thread, it is summing elements that are spaced by the grid dimension of threads (blockDim.x*gridDim.x) Each thread stores it's sum in a local (presumably register) variable named sum.

sPartials[tid] = sum;
__syncthreads();

As each thread finishes (i.e., as it's for-loop exceeds N) it stores it's intermediate sum in shared memory, and then waits for all other threads in the block to finish.

for ( int activeThreads = blockDim.x>>1;
          activeThreads > 32;
          activeThreads >>= 1 ) {
    if ( tid < activeThreads ) {
        sPartials[tid] += sPartials[tid+activeThreads];
    }
    __syncthreads();
}

So far we haven't talked about the dimension of the block - it hasn't mattered. Let's assume each block has some integer multiple of 32 threads. The next step will be to start gathering the various intermediate sums stored in shared memory into smaller and smaller groups of variables. The above code starts out by selecting half of the threads in the threadblock (blockDim.x>>1) and uses each of those threads to combine two of the partial sums in shared memory. So if our threadblock started out at 128 threads, we just used 64 of those threads to reduce 128 partial sums into 64 partial sums. This process continues repetetively in the for loop, each time cutting the threads in half and combining partial sums, two at a time per thread. This process continues as long as activeThreads > 32. So if activeThreads is 64, then those 64 threads will combine 128 partial sums into 64 partial sums. But when activeThreads becomes 32, the for-loop is terminated, without combining 64 partial sums into 32. So at the completion of this block of code, we have taken the (arbitrary multiple of 32 threads) threadblock, and reduced however many partial sums we started out with, down to 64. This process of combining say 256 partial sums, to 128 partial sums, to 64 partial sums, must wait at each iteration for all threads (in multiple warps) to complete their work, so the __syncthreads(); statement is executed with each pass of the for-loop.

Keep in mind, at this point, we have reduced our threadblock to 64 partial sums.

if ( threadIdx.x < 32 ) {

For the remainder of the kernel after this point, we will only be using the first 32 threads (i.e. the first warp). All other threads will remain idle. Note that there are no __syncthreads(); after this point either, as that would be a violation of the rule for using it (all threads must participate in a __syncthreads();).

    volatile int *wsSum = sPartials;

We are now creating a volatile pointer to shared memory. In theory, this tells the compiler that it should not do various optimizations, such as optimizing a particular value into a register, for example. Why didn't we need this before? Because __syncthreads(); also carries with it a memory-fencing function. A __syncthreads(); call, in addition to causing all threads to wait at the barrier for each other, also forces all thread updates back into shared or global memory. We can no longer depend on this feature, however, because from here on out we will not be using __syncthreads(); because we have restricted ourselves -- for the remainder of the kernel -- to a single warp.

    if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; // why do we need this

The previous reduction block left us with 64 partial sums. But we have at this point restricted ourselves to 32 threads. So we must do one more combination to gather the 64 partial sums into 32 partial sums, before we can proceed with the remainder of the reduction.

    wsSum[tid] += wsSum[tid + 16];  //how these statements are executed in paralle within a warp

Now we are finally getting into some warp-synchronous programming. This line of code depends on the fact that 32 threads are executing in lockstep. To understand why (and how it works at all) it will be convenient to break this down into the sequence of operations needed to complete this line of code. It looks something like:

    read the partial sum of my thread into a register
    read the partial sum of the thread that is 16 higher than my thread, into a register
    add the two partial sums
    store the result back into the partial sum corresponding to my thread

All 32 threads will follow the above sequence in lock-step. All 32 threads will begin by reading wsSum[tid] into a (thread-local) register. That means thread 0 reads wsSum[0], thread 1 reads wsSum[1] etc. After that, each thread reads another partial sum into a different register: thread 0 reads wsSum[16], thread 1 reads wsSum[17], etc. It's true that we don't care about the wsSum[32](and higher) values; we've already collapsed those into the first 32 wsSum[] values. However, as we'll see, only the first 16 threads (at this step) will contribute to the final result, so the first 16 threads will be combining the 32 partial sums into 16. The next 16 threads will be acting as well, but they are just doing garbage work -- it will be ignored.

The above step combined 32 partial sums into the first 16 locations in wsSum[]. The next line of code:

    wsSum[tid] += wsSum[tid + 8];

repeats this process with a granularity of 8. Again, all 32 threads are active, and the micro-sequence is something like this:

    read the partial sum of my thread into a register
    read the partial sum of the thread that is 8 higher than my thread, into a register
    add the two partial sums
    store the result back into the partial sum corresponding to my thread

So the first 8 threads combine the first 16 partial sums (wsSum[0..15]) into 8 partial sums (contained in wsSum[0..7]). The next 8 threads are also combining wsSum[8..23] into wsSums[8..15], but the writes to 8..15 occur after those values were read by threads 0..8, so the valid data is not corrupted. It's just extra junk work going on. Likewise for the other blocks of 8 threads within the warp. So at this point we have combined the partial sums of interest into 8 locations.

    wsSum[tid] += wsSum[tid + 4];  //this combines partial sums of interest into 4 locations
    wsSum[tid] += wsSum[tid + 2];  //this combines partial sums of interest into 2 locations
    wsSum[tid] += wsSum[tid + 1];  //this combines partial sums of interest into 1 location

And these lines of code follow a similar pattern as the previous two, partitioning the warp into 8 groups of 4 threads (only the first 4-thread group contributes to the final result) and then partitioning the warp into 16 groups of 2 threads, with only the first 2-thread group contributing to the final result. And finally, into 32 groups of 1 thread each, each thread generating a partial sum, with only the first partial sum being of interest.

    if ( tid == 0 ) {
        volatile int *wsSum = sPartials;// why this statement is needed?
        out[blockIdx.x] = wsSum[0];
    }

At last, in the previous step, we had reduced all partial sums down to a single value. It's now time to write that single value out to global memory. Are we done with the reduction? Perhaps, but probably not. If the above kernel were launched with only 1 threadblock, then we would be done -- our final "partial" sum is in fact the sum of all elements in the data set. But if we launched multiple blocks, then the final result from each block is still a "partial" sum, and the results from all blocks must be added together (somehow).

And to answer your final question?

I don't know why that statement is needed.

My guess is that it was left around from a previous iteration of the reduction kernel, and the programmer forgot to delete it, or didn't notice that it wasn't needed. Perhaps someone else will know the answer to this one.

Finally, the cuda reduction sample provides very good reference code for study, and the accompanying pdf document does a good job of describing the optimizations that can be made along the way.

Question 2

After the first two code block(separate by __syncthreads()), you can get 64 values in each thread block(stored in sPartials[] of each thread block). So the code from the if ( threadIdx.x < 32 ) is to accumulate the 64 values in each sPartials[]. It is just for optimizing the speed of the reduction. Because the data of the rest steps of the accumulation is small, it's not worthy to decrease threads and loop. You can just adjust the condition in second code block

for ( int activeThreads = blockDim.x>>1;
              activeThreads > 32;
              activeThreads >>= 1 )

to

for ( int activeThreads = blockDim.x>>1;
                  activeThreads > 0;
                  activeThreads >>= 1 )

instead of

if ( threadIdx.x < 32 ) {
        volatile int *wsSum = sPartials;
        if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; 
        wsSum[tid] += wsSum[tid + 16]; 
        wsSum[tid] += wsSum[tid + 8];
        wsSum[tid] += wsSum[tid + 4];
        wsSum[tid] += wsSum[tid + 2];
        wsSum[tid] += wsSum[tid + 1];

for better understanding.

After the accumulation, you can get only one value of each sPartials[], and stored in sPartials[0], here in your code is wsSum[0].

And after the kernel function, you can accumulate the values in wsSum in CPU to get the final result.

Question 3

The CUDA execution model in a nutshell: computations are divided between blocks on a grid. The blocks can have some shared resources (shared memory).

Each block is executed on a single Stream Multi-processor (SM), which is what makes the fast shared memory possible.

The work for each block is again split into warps of 32 threads. You can look at the work done by warps as independent tasks. The SM switches between warps very quickly. For example, when a thread accesses global memory, the SM will switch to a different warp.

You know nothing about the order in which warps are executed. All you know is that, after a call to __syncthreads, all threads will have run up to that point, and all memory reads and writes have been completed.

The important thing to note is that all threads in a warp all execute the same instruction, or some may be paused when there is a branch and different threads take different branches.

So, in the reduction example, the first part may be executed by multiple warps. In the last part, there are only 32 threads left, so only one warp is active. The line

if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32];

is there to add the partial sums computed by the other warps to the partial sums of our final warp.

The next lines work as follows. Because execution within a warp is synchronized, it is safe to assume that the write operation to wsSum[tid] is completed before the next read, and so there is no need for a __syncthreads call.

The volatile keyword lets the compiler know that values in the wsSum array may be changed by other threads, so it will make sure that the value of wsSum[tid + X] isn't read earlier, before it was updated by some thread in the previous instruction.

The last volatile declaration seems redundant: you could just as well use the existing wsSum variable.