Trying to understand cuda sdk final reduction snippet in complexity and execution terms

https://stackoverflow.com/questions/20525716

31-08-2022
|

Pergunta

I am trying to understand parallel reduction in Cuda (it is very interesting). In my last question about parallel reduction Robert Crovella gave a very intuitive and detailed explanation that helped me a lot. Very intuitive. Now looking the reduction samples of Cuda SDK there are few dark spots.

Why (in the below comment) is the work complexity is kept in O(n)? In which case the opposite would be happened? I have the the same question about step complexity.

This version adds multiple elements per thread sequentially. This reduces the overall cost of the algorithm while keeping the work complexity O(n) and the step complexity
O(log n).
Can someone give a intuitive example for the comment regarding the amount of shared memory ("Note this kernel needs... allocate blockSize*sizeof(T) bytes") and how it is related with the code?
Why is nIsPow2 so important? Can some explain or give an example?

Why do we use mySum for example in the following assignments? sdata[tid] = mySum = mySum + sdata[tid + 256]; and not just sdata[tid]+=data[tid+256] like in Marks Harris presentation?

/*This version adds multiple elements per thread sequentially.  This reduces the   
overall cost of the algorithm while keeping the work complexity O(n) and the       
step complexity O(log n).
(Brent's Theorem optimization)

Note, this kernel needs a minimum of 64*sizeof(T) bytes of shared memory.
In other words if blockSize <= 32, allocate 64*sizeof(T) bytes.
If blockSize > 32, allocate blockSize*sizeof(T) bytes.*/

template <class T, unsigned int blockSize, bool nIsPow2> 
__global__ void

reduce6(T *g_idata, T *g_odata, unsigned int n)

{
T *sdata = SharedMemory<T>();

// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;

T mySum = 0;

// we reduce multiple elements per thread.  The number is determined by the
// number of active thread blocks (via gridDim).  More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
    mySum += g_idata[i];

    // ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
    if (nIsPow2 || i + blockSize < n)
        mySum += g_idata[i+blockSize];

    i += gridSize;
}

// each thread puts its local sum into shared memory
sdata[tid] = mySum;
__syncthreads();


// do reduction in shared mem
if (blockSize >= 512)
{
    if (tid < 256)
    {
        sdata[tid] = mySum = mySum + sdata[tid + 256];
    }

    __syncthreads();
}

if (blockSize >= 256)
{
    if (tid < 128)
    {
        sdata[tid] = mySum = mySum + sdata[tid + 128];
    }

    __syncthreads();
}

if (blockSize >= 128)
{
    if (tid <  64)
    {
        sdata[tid] = mySum = mySum + sdata[tid +  64];
    }

    __syncthreads();
}

if (tid < 32)
{
    // now that we are using warp-synchronous programming (below)
    // we need to declare our shared memory volatile so that the compiler
    // doesn't reorder stores to it and induce incorrect behavior.
    volatile T *smem = sdata;

    if (blockSize >=  64)
    {
        smem[tid] = mySum = mySum + smem[tid + 32];
    }

    if (blockSize >=  32)
    {
        smem[tid] = mySum = mySum + smem[tid + 16];
    }

    if (blockSize >=  16)
    {
        smem[tid] = mySum = mySum + smem[tid +  8];
    }

    if (blockSize >=   8)
    {
        smem[tid] = mySum = mySum + smem[tid +  4];
    }

    if (blockSize >=   4)
    {
        smem[tid] = mySum = mySum + smem[tid +  2];
    }

    if (blockSize >=   2)
    {
        smem[tid] = mySum = mySum + smem[tid +  1];
    }
}

// write result for this block to global mem
if (tid == 0)
    g_odata[blockIdx.x] = sdata[0];

}

Solução

Sub-Question 1:

reduce6 supposedly adds more work - each thread sums up several elements. However, the total number of threads decreases by the same factor, so the total amount of work remains O(n). As for the 'step complexity' - that should be the number of kernel invocations. It is indeed O(log(n)), as the number of kernel invocations has dropped (each invocation reduces the data vector by a larger factor). I think it's still Theta(log(n)) though, if I remember correctly how the grid dimensions are set.

Sub-Question 2

Well, the kernel assumes each warp is 'full' in the sense that there's enough input data for all of its threads. A warp is 32 threads, and each thread reads at least 2 input elements, so the kernel assumes there are at least 64 input elements (whose size is 64*sizeof(T)). This also applies to shared memory, since the single-warp part of the kernel takes its input from shared memory, reading 32*2 elements of sizeof(T) bytes from it.

Sub-Question 3

When nIsPow2 is set to true - and note this is not a parameter the kernel is getting, but rather a different version of the code, compiled separately than for nIsPow2 being false - the condition nIsPow2 || i + blockSize < n always holds, and the compiler (assuming it's smart enough) will avoid the bound check altogether. Thus every thread will (safely) perform mySum += g_idata[i+blockSize].

Sub-Question 4

mysum is the per-warp sum. After we finish calculating that, some thread from each warp needs to share it with the other warps - and this is done via shared memory (which all warps in the block can access). Now, note that the 'double assignments' - both to mysum and to shared memory location - are due to further use by different threads: Each thread calculates its own mysum value - repeatedly adding to it. After the final assignment of mysum to a shared memory location, the thread exists - and a lower-index thread uses that assignment. But all threads execute the same code, so they perform 'unnecessary' assignments from mysum to shared memory.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow