OpenCL reduction from private to local then global?

https://stackoverflow.com/questions/14962597

10-03-2022
|

Question

The following kernel computes an acoustic pressure field, with each thread computing it's own private instance of the pressure vector, which then needs to be summed down into global memory. I'm pretty sure the code which computes the pressurevector is correct, but I'm still having trouble making this produce the expected result.

int gid       = get_global_id(0);
int lid       = get_local_id(0);
int nGroups   = get_num_groups(0);
int groupSize = get_local_size(0);
int groupID   = get_group_id(0);

/* Each workitem gets private storage for the pressure field.
 * The private instances are then summed into local storage at the end.*/
private float2    pressure[HYD_DIM_TOTAL];
local   float2    pressure_local[HYD_DIM_TOTAL];

/* Code which computes value of 'pressure' */

//wait for all workgroups to finish accessing any memory
barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);

/// sum all results in a workgroup into local buffer:
for(i=0; i<groupSize; i++){

    //each thread sums its own private instance into the local buffer
    if (i == lid){
        for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
            pressure_local[iHyd] += pressure[iHyd];
        }
    }
    //make sure all threads in workgroup get updated values of the local buffer
    barrier(CLK_LOCAL_MEM_FENCE);
}

/// copy all the results into global storage
//1st thread in each workgroup writes the group's local buffer to global memory
if(lid == 0){
    for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
        pressure_global[groupID +nGroups*iHyd] = pressure_local[iHyd];
    }
}

barrier(CLK_GLOBAL_MEM_FENCE);

/// sum the various instances in global memory into a single one
// 1st thread sums global instances
if(gid == 0){

    for(iGroup=1; iGroup<nGroups; iGroup++){

        //we only need to sum the results from the 1st group onward
        for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){

            pressure_global[iHyd] += pressure_global[iGroup*HYD_DIM_TOTAL +iHyd];
            barrier(CLK_GLOBAL_MEM_FENCE);
        }
    }
}

Some notes on data dimensions: The total number of threads will vary between 100 and 2000, but may on occasion lie outside this interval.
groupSizewill depend on hardware but I'm currently using values between 1(cpu) and 32(gpu).
HYD_DIM_TOTAL is known at compile time and varies between 4 and 32 (will generally, but not necessarily, be a power of 2).

Is there anything blatantly wrong with this reduction code?

PS: I run this on an i7 3930k with AMD APP SDK 2.8 and on an NVIDIA GTX580.

Solution

I notice two issues here, one big, one smaller:

This code suggests that you have a misunderstanding of what a barrier does. A barrier never synchronizes across multiple workgroups. It only synchronizes within a workgroup. The CLK_GLOBAL_MEM_FENCE makes it look like it is global synchronization, but it really isn't. That flag just fences all of the current work item's accesses to global memory. So outstanding writes will be globally observable after a barrier with this flag. But it does not change the barrier's synchronization behavior, which is only at the scope of a workgroup. There is no global synchronization in OpenCL, beyond launching another NDRange or Task.
The first for loop causes multiple work items to overwrite each others' computation. The indexing of pressure_local with iHyd will be done by each work item with the same iHyd. This will produce undefined results.

Hope this helps.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow