OpenCL kernel performing very poor?

https://stackoverflow.com/questions/22966890

30-06-2023
|

Question

My application takes 5200ms for computation of a data set using OpenCL on GPU, 330ms for same data using OpenCL on CPU; while the same data processing when done without OpenCL on CPU using multiple threads takes 110ms. The OpenCL timing is done only for kernel execution i.e. start just before clEnqueueNDRangeKernel and end just after clFinish. I have a Windows gadget which tells me that I am only using 19% GPU power. Even if I could make it to 100% still it would take ~1000ms which is much higher than my CPU.

enter image description here

The work group size is a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and I am using all computation units (6 for GPU and 4 for CPU). Here is my kernel:

__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
    size_t gid = get_global_id(0);

    myreal pCoef = coef[gid];
    myreal pRow = row[gid];

    pCoef = pCoef - (pRow * ratio);
    coef[gid] = pCoef;
}

I am getting similar poor performance for another kernel:

__kernel void calcURatios(__global myreal *ratios, __global myreal *rhs, myreal c, myreal r)
{
    size_t gid = get_global_id(0);

    myreal pRatios = ratios[gid];
    myreal pRHS = rhs[gid];

    pRatios = pRatios / c;
    ratios[gid] = pRatios;

    //pRatios = pRatios * r;
    pRHS = pRHS - (pRatios * r);
    rhs[gid] = pRHS;
}

Questions:

Why is my GPU performing so poor compared to CPU on OpenCL.
Why is CPU on OpenCL 3X slower than CPU without OpenCL but multi threaded?

Solution

Maybe you could add some information about how you enqueue this kernel - maybe with an inappropriate local work size? (In doubt, just pass null as the local work size - OpenCL will choose an appropriate one).

But even in the best case, it's unlikely that you will see a speedup here. The computation that you are doing there is heavily memory-bound. In the first kernel, you are reading two elements from global memory, then performing a trivial subtraction/multiplication, and afterwards writing an element to global memory (and in the second kernel, it's not much different). The bottleneck here is simply not the computation, but the data transfer.

(BTW: Recently, I wrote a few general words about that in https://stackoverflow.com/a/22868938 ).

Maybe the new developments of Unified Memory, HSA, AMD Kaveri etc. will come for the rescue here, but this is still in an early stage.

EDIT: Maybe you could also describe in which context you are performing these computations. If you have further computations (kernels) that work on the results of this kernel, maybe they could be combined on order to improve the memory/computation ratio.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow