Get optimum local/global workgroup size in OpenCL?

Question

Admittedly, I did not understand (and hardly tried to understand) what you are trying to compute there, because it looks overly complicated: Determining the best work-group size should hardly be related to the number of compute units, and it should not be necessary to compute it in such a complicated way.

As I said in the answer to the original question (and as confirmed by DarkZeros in his comment : As long as you don't use local memory etc., you can usually just pass null as the local work size, and OpenCL will choose it appropriately.

There may be some caveats, though. Depending on the the global work size, the underlying OpenCL implementation may not be able to use a "good" local work group size. For example: When the global work size is a prime number (that is larger than the maximum local work size), then an OpenCL implementation may be forced to use a local work size of 1...

This can usually be circumvented by padding the data to be a multiple of a more appropriate local work size. First of all, this means that you have to modify your kernel so that it obeys the limits of the work size. In your kernel from the other question, you would have to add another parameter for the size, and check this accordingly:

__kernel void reduceURatios(
    __global myreal *coef, 
    __global myreal *row, 
    myreal ratio,
    int sizeOfArrays)  // Add this parameter
{
    size_t gid = get_global_id(0);
    if (gid >= sizeOfArrays)
    {
        return; // Don't access invalid elements
    }

    myreal pCoef = coef[gid];
    myreal pRow = row[gid];

    pCoef = pCoef - (pRow * ratio);
    coef[gid] = pCoef;
}

Then you have more freedom for choosing the global work size. The code from the current question involved the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, but this should hardly be relevant for such a trivial kernel on a standard GPU. In contrast to that, the CL_DEVICE_MAX_WORK_GROUP_SIZE would be a good choice for the local work size (as long as there is no other limitation imposed by the kernel itself, e.g. by register pressure - but this is also definitely not the case here).

So you could just use the CL_DEVICE_MAX_WORK_GROUP_SIZE as the basis for the computation of your global work size:

// As queried with CL_DEVICE_MAX_WORK_GROUP_SIZE
int maxWorkGroupSize = ...
int numWorkGroups = (n-1) / maxWorkGroupSize + 1;
int globalSizePadded = numWorkGroups * maxWorkGroupSize;

And then invoke your kernel with this (padded) global work size. The if-statement that you added in the kernel will make sure that the threads will not access invalid memory regions. And when you launch your kernel with this padded global size, and set the local size to null, it should automatically choose the CL_DEVICE_MAX_WORK_GROUP_SIZE as the local size (but of course, you could also specify it manually).

This might make the computation from the original question faster, but it's still unlikely that it will be faster than the CPU version...