Question

Okay, so my kernel A is reading two variables from device memory. Performing their Ex-OR and storing them back to device memory.

However, some other kernel B which is performing a lot of extra computations on those variables and then storing them back to device memory is outperforming Kernel A. I understand If compute utilization is low, most of the time all the warps will be waiting for memory accesses to get done instead of doing some computation. But how come kernel B which has same no. of memory accesses per thread and extra computation doing everything faster?

Also, I tried to add a lot of synthetic arithmetic operations in the kernel A, but the profiler still shows the same compute utilization? What is actually happening there?

__global__ void A(int *dr,int p,int  q,int NORi)
{
    const int tid = blockDim.x * blockIdx.x + threadIdx.x;
    const int N = blockDim.x*gridDim.x;
    for(int i=0;i<NORi;i++)
    {
        dr[(i+p)*N +tid] = dr[i*N+tid] ^ dr[(i+q)*N+tid];
    }
}

The memory access is coalesced. Kernel B is same, but its performing much more arithmetic computations on the two operands instead of just an Ex-OR operation before storing the result.

No correct solution

OTHER TIPS

Is kernel B also performing the XOR? If not, it could be an issue with throughput of the various operations. Check the Throughput of Native Arithmetic Instructions table in the CUDA C Programming Guide. For instance, on Kepler, throughput on XOR is only 1/5th that of the throughput on integer multiplication.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top