Question

I am using the following function to get the best local and workgroup size for my OpenCL application.

//maxWGSize == CL_KERNEL_WORK_GROUP_SIZE
//wgMultiple == CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
//compUnits == CL_DEVICE_MAX_COMPUTE_UNITS
//rems == max required work items

void MyOpenCL::getBestWGSize(cl_uint maxWGSize, cl_uint wgMultiple, cl_uint compUnits, cl_uint rems, size_t *gsize, size_t *lsize) const
{
    cl_uint cu = 1;
    if(wgMultiple <= rems)
    {
        bool flag = true;
        while(flag)
        {
            if(cu < compUnits)
            {
                cu++;
                if((wgMultiple * cu) > rems)
                {
                    cu--;
                    flag = false;
                    break;
                }
            }
            else if(wgMultiple < maxWGSize)
            {
                wgMultiple *= 2;
                if((wgMultiple * cu) > rems)
                {
                    wgMultiple /= 2;
                    flag = false;
                    break;
                }
            }
            else
            {
                cu++;
                if(((wgMultiple * cu) > rems) || (cu > 2 * compUnits))
                {
                    cu--;
                    flag = false;
                    break;
                }
            }
        }
    }
    else
    {
        bool flag = true;
        wgMultiple = 2;
        while(flag)
        {
            if(cu < compUnits)
            {
                cu++;
                if((wgMultiple * cu) > rems)
                {
                    cu--;
                    flag = false;
                    break;
                }
            }
            else
            {
                wgMultiple *= 2;
                if((wgMultiple * cu) > rems)
                {
                    wgMultiple /= 2;
                    flag = false;
                    break;
                }
                else
                {
                    cl_int temp = rems - (wgMultiple * cu);
                    if((temp == 0) || (temp == 1))
                    {
                       flag = false;
                       break;
                    }
                }
            }
        }
    }

    *gsize = wgMultiple * cu;
    *lsize = wgMultiple;
    if(rems < *gsize)
    {
        *gsize = rems;
        *lsize = rems;
    }
    if(cu != compUnits)
    {
        while((cu * 2) <= compUnits)
        {
            cu *= 2;
            if(*lsize % 2 == 0)
                *lsize /= 2;
        }
    }
}

The algorithm is:

  1. Decide how many work group's are required if local size == CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
  2. If still more work units are required multiply local size by 2 until it reaches CL_KERNEL_WORK_GROUP_SIZE

Any suggestions in improving the algorithm?

Some results that I am getting:

for GPU if max required work items == 99
maxWGSize    256 
wgMultiple   64 
compUnits    6 
rems     99 
*gsize   64 
*lsize   16 


for GPU if max required work items == 35
maxWGSize    256 
wgMultiple   4 
compUnits    6 
rems     35 
*gsize   24 
*lsize   4 

for GPU if max required work items == 57
maxWGSize    256 
wgMultiple   8 
compUnits    6 
rems     57 
*gsize   48 
*lsize   8 

for CPU if max required work items == 99
maxWGSize    1024 
wgMultiple   16 
compUnits    4 
rems     99 
*gsize   64 
*lsize   16 

for CPU if max required work items == 35
maxWGSize    1024 
wgMultiple   8 
compUnits    4 
rems     35 
*gsize   32 
*lsize   8

for CPU if max required work items == 57
maxWGSize    1024 
wgMultiple   8 
compUnits    4 
rems     57 
*gsize   32 
*lsize   8 
Was it helpful?

Solution

Admittedly, I did not understand (and hardly tried to understand) what you are trying to compute there, because it looks overly complicated: Determining the best work-group size should hardly be related to the number of compute units, and it should not be necessary to compute it in such a complicated way.

As I said in the answer to the original question (and as confirmed by DarkZeros in his comment : As long as you don't use local memory etc., you can usually just pass null as the local work size, and OpenCL will choose it appropriately.

There may be some caveats, though. Depending on the the global work size, the underlying OpenCL implementation may not be able to use a "good" local work group size. For example: When the global work size is a prime number (that is larger than the maximum local work size), then an OpenCL implementation may be forced to use a local work size of 1...

This can usually be circumvented by padding the data to be a multiple of a more appropriate local work size. First of all, this means that you have to modify your kernel so that it obeys the limits of the work size. In your kernel from the other question, you would have to add another parameter for the size, and check this accordingly:

__kernel void reduceURatios(
    __global myreal *coef, 
    __global myreal *row, 
    myreal ratio,
    int sizeOfArrays)  // Add this parameter
{
    size_t gid = get_global_id(0);
    if (gid >= sizeOfArrays)
    {
        return; // Don't access invalid elements
    }

    myreal pCoef = coef[gid];
    myreal pRow = row[gid];

    pCoef = pCoef - (pRow * ratio);
    coef[gid] = pCoef;
}

Then you have more freedom for choosing the global work size. The code from the current question involved the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, but this should hardly be relevant for such a trivial kernel on a standard GPU. In contrast to that, the CL_DEVICE_MAX_WORK_GROUP_SIZE would be a good choice for the local work size (as long as there is no other limitation imposed by the kernel itself, e.g. by register pressure - but this is also definitely not the case here).

So you could just use the CL_DEVICE_MAX_WORK_GROUP_SIZE as the basis for the computation of your global work size:

// As queried with CL_DEVICE_MAX_WORK_GROUP_SIZE
int maxWorkGroupSize = ...
int numWorkGroups = (n-1) / maxWorkGroupSize + 1;
int globalSizePadded = numWorkGroups * maxWorkGroupSize;

And then invoke your kernel with this (padded) global work size. The if-statement that you added in the kernel will make sure that the threads will not access invalid memory regions. And when you launch your kernel with this padded global size, and set the local size to null, it should automatically choose the CL_DEVICE_MAX_WORK_GROUP_SIZE as the local size (but of course, you could also specify it manually).

This might make the computation from the original question faster, but it's still unlikely that it will be faster than the CPU version...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top