CUDA: How can I run the parallel reduction code for summation that is described in the NVIDIA paper by Mark Harris?

StackOverflow https://stackoverflow.com/questions/18023287

  •  21-06-2022
  •  | 
  •  

Question

Although I understand the logic behind parallel reduction described in this paper, I can't seem to be able to run it for a simple example where the input array has size 1s.

Here is what I achieved so far. Keep in mind that I'm using the thrust library to manage input and output data.

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <ctime>
#include <sys/time.h>
#include <sstream>
#include <string>
#include <fstream>

using namespace std;


__global__ void reduce0(int *g_idata, int *g_odata){

   extern __shared__ int sdata[];

  unsigned int tid = threadIdx.x;
  unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
  sdata[tid] = g_idata[i];

  __syncthreads();

  for(unsigned int s=1; s < blockDim.x; s *= 2) {
     if (tid % (2*s) == 0) {
        sdata[tid] += sdata[tid + s];
     }
  __syncthreads();
 }
 if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}


int main(void){

  int size = 10;
  thrust::host_vector<int> data_h_i(size, 1);

  //initialize the data, all values will be 1 
  //so the final sum will be equal to 10

  int threadsPerBlock = 256;
  int totalBlocks = size/threadsPerBlock + 1;

  dim3 dimGrid(totalBlocks,1,1);
  dim3 dimBlock(threadsPerBlock, 1, 1);

  thrust::device_vector<int> data_v_i = data_h_i;
  thrust::device_vector<int> data_v_o(size);

  int* output = thrust::raw_pointer_cast(data_v_o.data());
  int* input = thrust::raw_pointer_cast(data_v_i.data());

  reduce0<<<dimGrid, dimBlock>>>(input, output);

  data_v_i.clear();
  data_v_i.shrink_to_fit();

  thrust::host_vector<int> data_h_o = data_v_o;

  data_v_o.clear();
  data_v_o.shrink_to_fit();

  cout<<data_h_o[0]<<endl;


  return 0;

}

the code is simple, I create a host_vector of size size and initialize all the values to 1.

then I say that we need 256 threads per each block and find dynamically the amount of blocks needed for my example.

To keep things simple, I create an array of 10 values only, which means that we are going to require only one block. So one kernel invocation will be enough to produce the final result.

My questions are the following:

Question 1

After compiling the above example (nvcc -O3 reduction.cu -arch=sm_21) and entering ./a.out I get the following message:

terminate called after throwing an instance of 'thrust::system::system_error' what(): unspecified launch failure

I'm not sure what's going on here, but it seems to me that the error comes from the line

sdata[tid] = g_idata[i]

The kernel is an exact copy of the kernel described in the paper so I'm not sure what changes are required in order to fix this problem.

Question 2

If we fix the first problem, how could we make the above code work for arbitrary size of input array? If for example our size is more than 256, then we would need at least two blocks, so each block will give an output that will then have to be combined with the outputs of other blocks. In the paper it says that we would need multiple invocations of the kernel, however I'm not sure how this can be done dynamically.

Thank you in advance

EDIT1: For Question 1 it seems that I don't allocate memory for the shared memory correctly. Calling the kernel like that: reduce0<<<dimGrid, dimBlock, size*sizeof(int)>>>(input, output); and also checking to see if tid is not out of range. makes the code work properly. The new kernel is the following:

__global__ void reduce0(int *g_idata, int *g_odata, int size){

   extern __shared__ int sdata[];

   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

   if(tid<size){

     sdata[tid] = g_idata[i];
     __syncthreads();

    for(unsigned int s=1; s < size; s *= 2) {
        if (tid % (2*s) == 0) {
         sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
     }

   if (tid == 0) g_odata[blockIdx.x] = sdata[0];

  }

}

I'm still not sure about Question 2 though.

Était-ce utile?

La solution

Question 1

Your kernel is using dynamically allocated shared memory:

extern __shared__ int sdata[];
...
sdata[tid] = g_idata[i];

But you are not allocating any dynamic shared memory in your kernel call:

reduce0<<<dimGrid, dimBlock>>>(input, output);
                           ^
                           |
                           missing shared memory parameter.

So when you attempt to access the shared memory, you get a kernel fault. By the way you can still do cuda error checking on your kernel calls (even though you are using thrust elsewhere).

Question 2

Question 2 is pretty well answered in Mark's paper here You can see at the bottom of slide 9 that each block writes it's partial result to an array in global memory (g_odata[]) which stores one result per block. We then simply launch another kernel of essentially the same type that is operating on g_odata[] instead of the original input data. We can do this process successively until our partial results (e.g. g_odata[]) only contain 256 results, or how many threads we are launching in a threadblock. We can then sum that final result with a single threadblock and produce a single answer value.

examples are given in the cuda sample code here.

Here's an edited version of your code, that shows how to call the two kernels in sequence to handle a larger size. I don't consider this a paragon of reduction programming, just a simple extension of what you already wrote to illustrate the concept. Note that there are a variety of changes throughout the kernel and main code to facilitate the use of the kernel to handle larger data sizes. This method still won't scale beyond a data size of (threadsPerBlock ^2), but again it's just to illustrate the concept of calling multiple kernels in sequence to sum partial results, with fewest modifications to your code.

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <ctime>
#include <sys/time.h>
#include <sstream>
#include <string>
#include <fstream>

using namespace std;


__global__ void reduce0(int *g_idata, int *g_odata, int size){

   extern __shared__ int sdata[];

   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
   sdata[tid] = 0;
   if(i<size)
     sdata[tid] = g_idata[i];
   __syncthreads();

  for(unsigned int s=1; s < blockDim.x; s *= 2) {
        if (tid % (2*s) == 0) {
         sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
     }

   if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

int main(void){

  int size = 40000;
  thrust::host_vector<int> data_h_i(size, 1);

  //initialize the data, all values will be 1
  //so the final sum will be equal to size

  int threadsPerBlock = 256;
  int totalBlocks = (size+(threadsPerBlock-1))/threadsPerBlock;

  thrust::device_vector<int> data_v_i = data_h_i;
  thrust::device_vector<int> data_v_o(totalBlocks);

  int* output = thrust::raw_pointer_cast(data_v_o.data());
  int* input = thrust::raw_pointer_cast(data_v_i.data());
  reduce0<<<totalBlocks, threadsPerBlock, threadsPerBlock*sizeof(int)>>>(input, output, size);

  reduce0<<<1, threadsPerBlock, threadsPerBlock*sizeof(int)>>>(output, input, totalBlocks);
  data_v_o[0] = data_v_i[0];
  data_v_i.clear();
  data_v_i.shrink_to_fit();

  thrust::host_vector<int> data_h_o = data_v_o;

  data_v_o.clear();
  data_v_o.shrink_to_fit();

  cout<<data_h_o[0]<<endl;


  return 0;

}

Autres conseils

After modifying the code made by Robert Crovella to answer my question, here is the final version which supports arbitrary amount of input values.

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <ctime>
#include <sys/time.h>
#include <sstream>
#include <string>
#include <fstream>

using namespace std;


__global__ void reduce0(int *g_idata, int *g_odata, int size){

   extern __shared__ int sdata[];

   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
   sdata[tid] = 0;
   if(i<size)
     sdata[tid] = g_idata[i];
   __syncthreads();

   for(unsigned int s=1; s < blockDim.x; s *= 2) {
        if (tid % (2*s) == 0) {
         sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
     }

   if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

int main(void){

  int size = 939289;
  thrust::host_vector<int> data_h_i(size, 1);

  //initialize the data, all values will be 1
  //so the final sum will be equal to size

  int threadsPerBlock = 256;
  int totalBlocks = (size+(threadsPerBlock-1))/threadsPerBlock;

  thrust::device_vector<int> data_v_i = data_h_i;
  thrust::device_vector<int> data_v_o(totalBlocks);

  int* output = thrust::raw_pointer_cast(data_v_o.data());
  int* input = thrust::raw_pointer_cast(data_v_i.data());

  bool turn = true;

  while(true){

    if(turn){

      reduce0<<<totalBlocks, threadsPerBlock, threadsPerBlock*sizeof(int)>>>(input, output, size);
      turn = false;

    }
    else{

      reduce0<<<totalBlocks, threadsPerBlock, threadsPerBlock*sizeof(int)>>>(output, input, size);
      turn = true;

    }

    if(totalBlocks == 1) break;

    size = totalBlocks;
    totalBlocks = ceil((double)totalBlocks/threadsPerBlock);

  }

  thrust::host_vector<int> data_h_o;

  if(turn)
    data_h_o = data_v_i;
  else
    data_h_o = data_v_o;

  data_v_i.clear();
  data_v_i.shrink_to_fit();

  data_v_o.clear();
  data_v_o.shrink_to_fit();

  cout<<data_h_o[0]<<endl;


  return 0;

}
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top