CUDA Algorithm Cascading

Question

It refers to the idea that multiple algorithms (methodologies) are used to sum all the elements together.

The first algorithm that is used is a parallel sequential sum, performed by many threads. Each thread accumulates a partial sum of a subset of elements in the overall array to be summed. This algorithm is sequential in nature (from the viewpoint of thread code) and is conceptually like a for-loop, or while-loop. Each thread is acting indvidually, and there is no explicit cooperation between threads. This algorithm (parallel sequential sums) is covered on slide 33.

The first algorithm cascades its results (partial sums accumulated across many threads) to a second algorithm, which accumulates groups of partial sums together, creating a relatively smaller number of partial sums. This algorithm operates within a threadblock, accumulating the partial sums of each thread together in a parallel sweeping-collapsing fashion (classical parallel reduction) which is pictorially indicated on slide 5 (the green bar). Each element at the top of the green bar in slide 5 represents one thread's partial sum accumulated in the previous (sequential) step.

A final level of reduction may be used also (blue bar on slide 5) which is conceptually similar to the second algorithm described above. The partial results from the previous step are then cascaded to this step, which does a final summation of partial results, producing a single, final result.

Using this cascade process, the first algorithm and the second algorithm look very different from each other pictorially, and at the code level, but combined together they produce a fast reduction for arbitrary sized arrays, and arbitrary sized grids (arrays of threads).