CUDA code optimization; Number of registers

https://stackoverflow.com/questions/18639853

27-06-2022
|

質問

I am pasting some code here for everyone to see.

__global__ void Integrate(double a, double b) {
    __shared__ double extrapol[16];
    __shared__ double result[32];
    __shared__ double h;
    __shared__ double err;

    __shared__ double x;
    __shared__ int n;

    if (threadIdx.x == 0) {
        h       = b - a;
        err     = 1.0;
        
        if (0.0 == a)
            extrapol[0] = 0.5 * h * myfunc(b);
        else
            extrapol[0] = 0.5 * h * (myfunc(a) + myfunc(b));

        n = 1;
    }

    for (int i = 1; i < 16; i++) {
        if (threadIdx.x == 0)
            x = a + h * 0.5;

        __syncthreads();
    
        if (err <= EPSILON)
            break;

        Trapezoid(result, x, h, n);
        if (threadIdx.x == 0) {
            result[0] = (extrapol[0] + h * result[0]) * 0.5;

            double power = 1.0;
            for (int k = 0; k < i; k++) {
               power *= 4.0;
               double sum  = (power * result[0] - extrapol[k]) / (power - 1.0);
               extrapol[k] = result[0];
               result[0] = sum;
            }

            err = fabs(result[0] - extrapol[i - 1]);
            extrapol[i] = result[0];
            n *= 2;
            h *= 0.5;
         }
    }
}

Essentially it is an adaptive numberical integrator (Romberg). The device functions used in this global functions are:

__device__ void Trapezoid(double *sdata, double x, double h, int n) {
    int nIdx = threadIdx.x + blockIdx.x * blockDim.x;
    sdata[nIdx] = 0;

    while (nIdx < n) {
       sdata[threadIdx.x] += myfunc(x + (nIdx * h));
       nIdx += 32;
    }
    Sum(sdata, threadIdx.x);
}

Parallel reduction function:

 __device__ void Sum(volatile double *sdata, int tId) {
     if (tId < 16) {
         sdata[tId] += sdata[tId + 16];
         sdata[tId] += sdata[tId + 8];
         sdata[tId] += sdata[tId + 4];
         sdata[tId] += sdata[tId + 2];
         sdata[tId] += sdata[tId + 1];
     }
}

And finally the function I am trying to integrate is (mock up simple function) given as:

__device__ double myfunc(double x) {
     return 1 / x;
}

The code executes well and the expected integral is obtained. The kernel is executed in the following manner (for now)

Integrate <<< 1, 32 >>>(1, 2);

Question:
When I use nvidia visual profiler to check out the usage of registers for this function. It turns out to be 52 registers per thread. I don't understand why? Most of the variables I have in this code are shared variables. Can you let me know how can I find out which parts of my code are using registers?

How can I reduce them? Is there any optimization that I can do with this code?

Hardware

I am using fermi device Geforce GTX 470, compute capability 2.0

Thanks,

解決

The register usage is not immediately related to the number of defined variables since, for example, registers are used to store the results for intermediate calculations where a variable is not being defined.

One possibility to try to spot the parts of the code mostly using registers is to try hacking the ptx file by manually annotating it with a syntax like

asm volatile ("// code at this line is doing this and this ...");

他のヒント

You can use the ptxas program to analyze your ptx files to show you register and memory usage of each function. In your case you'd want to do ptxas --gpu-name sm_20 -v code.ptx.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow