CUDA Dot Product

https://stackoverflow.com/questions/9449541

12-11-2019
|

Question

I'm trying to implement the classic dot-product kernel for double precision arrays with atomic computation of the final sum across the various blocks. I used the atomicAdd for double precision as stated in page 116 of the programming guide.Probably i'm doing something wrong.The partial sums across the threads in every block are computed correctly but afterwords the atomic operation doesn't seem to be working properly since every time i run my kernel with the same data,i receive different results. I'll be grateful if somebody could spot the mistake or provide an alternative solution! Here is my kernel:

__global__ void cuda_dot_kernel(int *n,double *a, double *b, double *dot_res)
{
    __shared__ double cache[threadsPerBlock]; //thread shared memory
    int global_tid=threadIdx.x + blockIdx.x * blockDim.x;
    int i=0,cacheIndex=0;
    double temp = 0;
    cacheIndex = threadIdx.x;
    while (global_tid < (*n)) {
        temp += a[global_tid] * b[global_tid];
        global_tid += blockDim.x * gridDim.x;
    }
    cache[cacheIndex] = temp;
    __syncthreads();
    for (i=blockDim.x/2; i>0; i>>=1) {
        if (threadIdx.x < i) {
            cache[threadIdx.x] += cache[threadIdx.x + i];
        }
        __syncthreads();
    }
    __syncthreads();
    if (cacheIndex==0) {
        *dot_res=cuda_atomicAdd(dot_res,cache[0]);
    }
}

And here is my device function atomicAdd:

__device__ double cuda_atomicAdd(double *address, double val)
{
    double assumed,old=*address;
    do {
        assumed=old;
        old= __longlong_as_double(atomicCAS((unsigned long long int*)address,
                    __double_as_longlong(assumed),
                    __double_as_longlong(val+assumed)));
    }while (assumed!=old);

    return old;
}

Solution

You are using the cuda_atomicAdd function incorrectly. This section of your kernel:

if (cacheIndex==0) {
    *dot_res=cuda_atomicAdd(dot_res,cache[0]);
}

is the culprit. Here, you atomically add to dot_res. then non atomically set dot_res with the result it returns. The return result from this function is the previous value of the location being atomically updated, and it supplied for "information" or local use of the caller only. You don't assign it to what you are atomically updated, that completely defeats the purpose of using atomic memory access in the first place. Do something like this instead:

if (cacheIndex==0) {
    double result=cuda_atomicAdd(dot_res,cache[0]);
}

OTHER TIPS

Getting a reduction right using ad hoc CUDA code can be tricky, so here's an alternative solution using a Thrust algorithm, which is included with the CUDA Toolkit:

#include <thrust/inner_product.h>
#include <thrust/device_ptr.h>

double do_dot_product(int n, double *a, double *b)
{
  // wrap raw pointers to device memory with device_ptr
  thrust::device_ptr<double> d_a(a), d_b(b);

  // inner_product implements a mathematical dot product
  return thrust::inner_product(d_a, d_a + n, d_b, 0.0);
}

Did not checked your code that depth but here are some advices.
I would only advice using Thrust if you only use your GPU for such generic tasks, since if a complex problem will arise people have no idea to efficiently program parallel on the gpu.

Start a new parallel reduction kernel to summarize the dot product.
Since the data is already on the device you won't see a decrease in performance starting a new kernel.
Your kernel seems not to scale across the maximum number of possible blocks on the newest GPU. If it would and your kernel would be able to calculate the dot product of millions of values the performance would decrease dramatically because of the serialized atomic operation.
Beginner mistake: Is your input data and shared memory access range checked? Or are you sure the input data is always multiple of your block size? Else you will read garbage. Most of my wrong results were due to this fault.
optimise your parallel reduction. My Thesis or Optimisations Mark Harris

Untested, i just wrote it down in notepad:

/*
 * @param inCount_s unsigned long long int Length of both input arrays
 * @param inValues1_g double* First value array
 * @param inValues2_g double* Second value array
 * @param outDots_g double* Output dots of each block, length equals the number of blocks
 */
__global__ void dotProduct(const unsigned long long int inCount_s,
    const double* inValuesA_g,
    const double* inValuesB_g,
    double* outDots_g)
{
    //get unique block index in a possible 3D Grid
    const unsigned long long int blockId = blockIdx.x //1D
            + blockIdx.y * gridDim.x //2D
            + gridDim.x * gridDim.y * blockIdx.z; //3D


    //block dimension uses only x-coordinate
    const unsigned long long int tId = blockId * blockDim.x + threadIdx.x;

    /*
     * shared value pair products array, where BLOCK_SIZE power of 2
     *
     * To improve performance increase its size by multiple of BLOCK_SIZE, so that each threads loads more then 1 element!
     * (outDots_g length decreases by same factor, and you need to range check and initialize memory)
     * -> see harris gpu optimisations / parallel reduction slides for more informations.
     */
    __shared__ double dots_s[BLOCK_SIZE];


    /*
     * initialize shared memory array and calculate dot product of two values, 
     * shared memory always needs to be initialized, its never 0 by default, else garbage is read later!
     */
    if(tId < inCount_s)
        dots_s[threadIdx.x] = inValuesA_g[tId] * inValuesB_g[tId];
    else
        dots_s[threadIdx.x] = 0;
    __syncthreads();

    //do parallel reduction on shared memory array to sum up values
    reductionAdd(dots_s, dots_s[0]) //see my thesis link

    //output value
    if(threadIdx.x == 0)
        outDots_g[0] = dots_s[0];

    //start new parallel reduction kernel to sum up outDots_g!
}

Edit: removed unnecessary points.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow