CUDA kernel results different in release mode

https://stackoverflow.com//questions/11666049

11-12-2019
|

Question

I am testing some code in CUDA(I am new to CUDA and this is my first application). So far I have achieved results in CUDA which are same as obtained by running the code serially on CPU. I am using Visual Studio 2010 and the Build Configuration is Debug. But as soon as I change the Build Configuration to "Release", I start getting wrong results. I have not been able to use Nvidia forums as they are down at the moment. Can someone with CUDA experience point out the issue. The code is as follows

__global__ void MyKernel(int *Nptr,int *deltaptr, double *gravityptr, double *separationptr, double *fconptr, double *xForce, double *yForce, double *zForce,
double *xPos, double *yPos, double *zPos )
{
int N = *Nptr;
int delta= *deltaptr;
double gravity= *gravityptr;
double separation = *separationptr;
double fcon = *fconptr;

double len=0.0;
double r12X =0.0;
double r12Y =0.0;
double r12Z =0.0;
double PE=0.0;


int nx = blockDim.x * blockIdx.x + threadIdx.x;//use this place of nx
//int ny = blockDim.x * blockIdx.x + threadIdx.y;//use this place of ny
int ny = blockDim.y * blockIdx.y + threadIdx.y;
//printf("nx:%d ny:%d\n", nx,ny);

if(!(nx< N && ny <N))
    return;
//printf("nx:%d ny:%d\n", nx,ny);


xForce[nx*N+ny] = 0.0;
yForce[nx*N+ny] = -gravity;
zForce[nx*N+ny] = 0.0;

int lowerValuedx = maxOnDevice(nx-delta,0);
int upperValuedx=minOnDevice(nx+delta+1,N);
for(int dx=lowerValuedx; dx<upperValuedx;dx++)
{
    int lowerValuedy=maxOnDevice(ny-delta,0);
    int upperValuedy=minOnDevice(ny+delta+1,N);
    for(int dy=lowerValuedy; dy<upperValuedy;dy++)
    {
        len=sqrt((double)((nx-dx)*(nx-dx)+(ny-dy)*(ny-dy)) ) *separation;
        bool condition = ny!=dy;
        bool condition1 = nx!=dx;

        //if (nx!=dx || ny!=dy)
        if (condition || condition1)
        {
            r12X = xPos[dx*N+dy] - xPos[nx*N+ny];
            r12Y = yPos[dx*N+dy] - yPos[nx*N+ny];
            r12Z = zPos[dx*N+dy] - zPos[nx*N+ny];
            xForce[nx*N+ny] = xForce[nx*N+ny] +fcon*normxOnDevice(r12X,r12Y,r12Z)*(magOnDevice(r12X,r12Y,r12Z)-len);
            yForce[nx*N+ny]= yForce[nx*N+ny] +fcon*normyOnDevice(r12X,r12Y,r12Z)*(magOnDevice(r12X,r12Y,r12Z)-len);
            zForce[nx*N+ny]= zForce[nx*N+ny] +fcon*normzOnDevice(r12X,r12Y,r12Z)*(magOnDevice(r12X,r12Y,r12Z)-len);


        }
    }
}

}

Thanks

Solution

It is not unusual for numerical differences to exist between CPU and GPU results, and between debug and release builds on the GPU. This does not mean either set of results is incorrect, but one may be more accurate than the other. See the following whitepaper from NVIDIA that discusses various mechanism that can lead to numerical discrepancies:

http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

You can check whether the nvcc flag -fmad=false eliminates the differences you are seeing, which would indicate that these are due to FMA/FMAD merging, and thus likely harmless.

GPUs provide FMAD and FMA (fused multiply-add) operations that combine a floating-point multiply with a dependent floating-point add into a single operation. This helps with performance, because the combined operation typically takes a similar amount of time as each of its constituent parts. However, the rounding behavior of either combined operation is different from using two individually rounded operations:

Single-precision FMAD (compute capability < 2.0) truncates the result of the multiply, then rounds the result of the final addition according to IEEE-754 round-to-nearest-or-even. By contrast FMA (single precision on compute capability >= 2.0, and double precision) computes the unrounded, double-wide product, adds the third operand to that, and rounds the final sum according to IEEE-754 round-to-nearest-or-even. Because of this single rounding, the average accuracy provided by FMA is superior to using two separately rounded operations. The FMA operations is specified in the 2008 version of the IEEE-754 floating-point standard.

By default, for release builds, the CUDA compiler generates merged operations (FMAD, FMA) aggressively to achieve the best performance. In other words, the compiler default is -fmad=true which allows the compiler to merge floating-point multiplies and adds. By specifying -fmad=false, the merging of multiplies and adds is inhibited, which typically provides greater consistency with CPU results, since most CPUs do not provide the FMA operation. Obviously disabling the use of the merged operations has a negative impact on performance, so -fmad=false is primarily useful as a sanity check.

Where accuracy issues are suspected, I generally recommend comparison with a higher-precision reference implementation (e.g. one based on quad precision or double-double techniques) to accurately assess the error on both CPU and GPU, rather than using the CPU version as a reference (as the CPU results are also affected by round-off error).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow