Cuda Kernel with reduction - logic errors for dot product of 2 matrices

Question

Hopefully you can figure out why it works for the n=2 case, so let's skip that, and take a look at why it fails for some other case, let's choose n=4. When n = 4, you have 4 threads, numbered 0 to 3.

In the first iteration of your for-loop, stride = 1, so the threads that pass the if test are threads 0 and 2.

thread 0:   dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2:   dc[2] += da[2]*db[2] + da[3]*db[3];

So far so good. In the second iteration of your for loop, stride is 2, so the thread that passes the if test is thread 0 (only).

thread 0:   dc[0] += da[0]*db[0] + da[2]*db[2];

But this doesn't make sense and is not what we want at all. What we want is something like:

dc[0] += dc[2];

So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):

 // CUDA kernel code
 __global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     // do multiplication in parallel for full width of threads
     dc[tid] = da[tid] * db[tid];
     // wait for all threads to complete multiply step
     __syncthreads();
     int stride = blockDim.x;
     while (stride > 1){
       // handle odd step
       if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
       // successively divide problem by 2
       stride >>= 1;
       // add each upper half element to each lower half element
       if (tid < stride) dc[tid] += dc[tid + stride];
       // wait for all threads to complete add step
       __syncthreads();
       }
 }

Note that I'm not really using the n parameter. Since you are launching the kernel with n threads, the blockDim.x built-in variable is equal to n in this case.