How am I creating a RAW hazard here?

https://stackoverflow.com/questions/21223910

30-09-2022
|

Domanda

Really appreciate some help on this snippet. I've been trying to work my way through creating my own reduction kernels, and I do know that they already exist out there so this is just for my own benefit. I made a max reduction that works, but when I modified it to calculate the sum I get varying results. cuda-memcheck gives me a 10megabyte file full of errors and warnings, so I assumeI have a race condition that I just don't see. Here's the code snippet.

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <iostream>
#include <fstream>
#include <iomanip>                      //display 2 decimal places
#include <math.h>
#include <ctime>                        //For timers

using namespace std;

#define NUMBLOCKSPERGRID 1                              //Universal number of blocks per grid to use.
#define NUMTHREADSPERBLOCK 256                          //Universal number of threads per block.
#define MAXLENGTH NUMTHREADSPERBLOCK*NUMBLOCKSPERGRID   //Use multiple of 256
#define NUMPOINTS 256                               //Number of timesteps to integrate.
double concStorage[NUMPOINTS][MAXLENGTH] = {};          //Stores concs [rows] vs. time [columns]

__device__ __constant__ int numThreads = NUMTHREADSPERBLOCK;    //Number of threads per block
__device__ __constant__ int numBlocks = NUMBLOCKSPERGRID;       //Number of blocks per grid
__device__ __constant__ int maxlength = MAXLENGTH;              //The largest polymer species we track.


//Temporary device arrays to use during integration.  Will not send back data.
__device__ double temp4[MAXLENGTH];

__global__ void rkf5(size_t, double*, double*, double*, double*);
__device__ void arrSum2(double*, double*);
__global__ void arrSumKernel(double*, double*, int);

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
    //Error checking
    if (code != cudaSuccess) 
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}
__global__ void arrInit(double* a, double b)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    a[idx]=b;
}
__global__ void rkf5(size_t pitch, double* concStorage, double* concs, double* dt, double* d_ts)
{
    arrInit<<< numBlocks, numThreads >>>(temp4, 1);  //works
    cudaDeviceSynchronize();
    double p = 0;
    arrSum2(temp4, &p);
    cudaDeviceSynchronize();
    arrInit<<< numBlocks, numThreads >>>(d_ts, p);
    cudaDeviceSynchronize();
}

__device__ void arrSum2(double* arr, double* sumVal )
{
    /*
    Description : Sums all elements of array.
    Status : Untested.
    */

    int maxThreads = 1024;  //This can be reduced, but this is a max.  Make reducible for optimization.
    int blocks = int(maxlength/maxThreads)+1;   //works

    double* kernelSums= new double[blocks];
    double* blockSums= new double[1];

    arrInit<<< 1, blocks >>>(kernelSums, 0);    //works
    arrInit<<< 1, 1 >>>(blockSums, 0);  //works
    cudaDeviceSynchronize();

    arrSumKernel<<< blocks, maxThreads, maxThreads*sizeof(double) >>>(arr, kernelSums, maxlength);
    cudaDeviceSynchronize();
    arrSumKernel<<< 1, blocks, blocks*sizeof(double) >>>(kernelSums, blockSums, blocks);
    cudaDeviceSynchronize();
    *sumVal = blockSums[0];

}
__global__ void arrSumKernel(double* arr, double* sums, int length)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    extern __shared__ double blockMemory[];

    //This component will move the global array component to shared memory if its index < length, else pad with 0.
    if (idx < length)
        blockMemory[threadIdx.x] = arr[idx];
    else
        blockMemory[threadIdx.x] = 0;

    __syncthreads();

    int stage = 0;
    int maxStage = static_cast<int>(logf(blockDim.x)/logf(2));  //logf needed for CUDA

    while (stage <= maxStage)
    {
        int left = threadIdx.x;         
        int right = (threadIdx.x) + powf(2, (stage));       //idx+1 if idx is even, 0 if odd

        if (( right < blockDim.x ) && ( left % int(powf(2, stage)) == 0 ))
            blockMemory[left] += blockMemory[right];        //Move larger value left.

        stage++;
        __syncthreads();
    }

    sums[blockIdx.x] = blockMemory[0];
    __syncthreads();
}

int main(int argc, char** argv)
{
    //Main program.
    cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);
    cudaSetDevice(0);
    std::cout << std::fixed;                //Displays 2 decimal places.
    std::cout << std::setprecision(16);     //Displays 2 decimal places.

    const int numpoints = NUMPOINTS;                    //Number of timesteps to take.
    const int maxlength = MAXLENGTH;                    //Number of discrete concentrations we are tracking.
    double mo = 5E-6;                                   //Initial concentration in M.
    double to = 0;                                      //Beginning integration time in seconds.
    double tf = 7200;                                   //Final integration time in seconds.
    double dt = (tf-to)/static_cast<double>(numpoints); //Static step size in seconds.

    double concs[maxlength] = {};               //Meant to store the initial concentrations .
    double ts[numpoints]= {};                   //Can hold all the times at which concentrations are stored.

    //Initialize all the arrays on the host to ensure arrays of 0's are sent to the device.
    //Also, here is where we can seed the system.
    std::cout<<dt;
    std::cout<<"\n";
    concs[0]=mo;
    std::cout<<concs[0];
    std::cout<<" ";

    concs[0]=mo;
    std::cout<<"\n";


    //Define all the pointers to device array memory addresses. These contain the on-GPU
    //addresses of all the data we're generating/using.
    double *d_concStorage;          //On GPU 2D array that stores all the concentrations and associated times.
    double *d_concs;                //The concentrations for a specific timestep.
    double *d_dt;                   //The length of the timestep.
    double *d_ts;

    //Calculate all the sizes of the arrays in order to allocate the proper amount of memory on the GPU.
    size_t size_concs = sizeof(concs);
    size_t size_dt = sizeof(dt);
    size_t size_ts = sizeof(ts);
    size_t h_pitch = maxlength*sizeof(double);
    size_t d_pitch;

    //Calculate the "pitch" of the 2D array.  The pitch is basically the length of a 2D array's row.  IT's larger 
    //than the actual row full of data due to hadware issues.  We thusly will use the pitch instead of the data 
    //size to traverse the array.
    gpuErrchk(cudaMallocPitch( (void**)&d_concStorage, &d_pitch, maxlength * sizeof(double), numpoints)); 

    //Allocate memory on the GPU for all the arrrays we're going to use in the integrator.
    gpuErrchk(cudaMalloc((void**)&d_concs, size_concs));
    gpuErrchk(cudaMalloc((void**)&d_dt, size_dt));
    gpuErrchk(cudaMalloc((void**)&d_ts, size_ts));

    //Copy all initial values of arrays to GPU.
    gpuErrchk(cudaMemcpy2D(d_concStorage, d_pitch, concStorage, h_pitch, maxlength*sizeof(double), numpoints, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_concs, &concs, size_concs, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_dt, &dt, size_dt, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_ts, &ts, size_ts, cudaMemcpyHostToDevice));

    //Run the integrator.
    std::clock_t start;
    double duration;
    start = std::clock();
    rkf5<<<1,1>>>(d_pitch, d_concStorage, d_concs, d_dt, d_ts);
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );
    duration = (std::clock() - start)/ (double) CLOCKS_PER_SEC;
    std::cout<<"printf: "<< duration <<'\n';

    //Copy 2D array of concentrations vs. time from GPU to Host.
    gpuErrchk( cudaMemcpy2D(concStorage, h_pitch, d_concStorage, d_pitch, maxlength*sizeof(double), numpoints, cudaMemcpyDeviceToHost) );   
    gpuErrchk( cudaMemcpy(&ts, d_ts, size_ts, cudaMemcpyDeviceToHost));
    cudaDeviceSynchronize();

    for (int i=0; i < maxlength; i++)
    {
        std::cout << " ";
        std::cout << ts[0];
    }



    cudaDeviceReset();  //Clean up all memory.
    return 0;
}

Resulting in a boatload of these :

========= CUDA-MEMCHECK
========= ERROR: Potential RAW hazard detected at __shared__ 0x1007 in block (0, 0, 0) :
=========     Write Thread (512, 0, 0) at 0x000030c8 in C:/Users/Karsten Chu/New Google Drive/Research/Visual Studio 2012/Projects/Dynamic Parallelism Test/Dynamic Parallelism Test/arrayFunctions.cu:209:arrSumKernel(double*, double*, int)
=========     Read Thread (0, 0, 0) at 0x00003038 in C:/Users/Karsten Chu/New Google Drive/Research/Visual Studio 2012/Projects/Dynamic Parallelism Test/Dynamic Parallelism Test/arrayFunctions.cu:209:arrSumKernel(double*, double*, int)
=========     Current Value : 0
=========
========= ERROR: Potential RAW hazard detected at __shared__ 0x1006 in block (0, 0, 0) :
=========     Write Thread (512, 0, 0) at 0x000030c8 in C:/Users/Karsten Chu/New Google Drive/Research/Visual Studio 2012/Projects/Dynamic Parallelism Test/Dynamic Parallelism Test/arrayFunctions.cu:209:arrSumKernel(double*, double*, int)
=========     Read Thread (0, 0, 0) at 0x00003038 in C:/Users/Karsten Chu/New Google Drive/Research/Visual Studio 2012/Projects/Dynamic Parallelism Test/Dynamic Parallelism Test/arrayFunctions.cu:209:arrSumKernel(double*, double*, int)
=========     Current Value : 0

Strangely, the result , which given these parameters should give me "256" is correct if I run it through cuda-memcheck --tools racecheck. Results vary if I just run the app.

Soluzione

How am I creating a RAW hazard here?

Let's take a look at this loop:

int stage = 0;
int maxStage = static_cast<int>(logf(blockDim.x)/logf(2));  //logf needed for CUDA

while (stage <= maxStage)
{
    int left = threadIdx.x;         
    int right = (threadIdx.x) + powf(2, (stage));       //idx+1 if idx is even, 0 if odd

    if (( right < blockDim.x ) && ( left % int(powf(2, stage)) == 0 ))
        blockMemory[left] += blockMemory[right];        //Move larger value left.

    stage++;
    __syncthreads();
}

sums[blockIdx.x] = blockMemory[0];

On the very first pass through the while loop, stage = 0. left = threadIdx.x, and right = threadIdx.x + 1. All threads but the last one pass the first part of your conditional if test: (right < blockDim.x) and all threads pass the second part of the test: (left % int(powf(2, stage)) == 0), since any integer % 1 == 0.

Therefore, this line of code:

    blockMemory[left] += blockMemory[right];

gets executed for every thread in the block (except the last one.) But that can't be correct, because now the results will depend on the order of warp execution (if warp 1 executes before warp 0, then the result for warp 0 thread 31 will be different than if warp 0 executes before warp 1) and this constitutes a race condition, or RAW hazard.

If I modify the conditional test as follows:

    if (( right < blockDim.x ) && ( left % int(powf(2, stage+1)) == 0 ))

Then I get consistently reproducible results (256).

I'm not saying the code is bug free at that point, but I think that should point you in the direction of fixing it. It's not how I would do a parallel reduction. Certainly the use of the transcendentals (powf, logf) should be avoided for performance, and this is not difficult to do for powers of 2.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow