Frage

I'm calling the kernel below with GlobalWorkSize 64 4 1 and WorkGroupSize 1 4 1 with the argument output initialized to zeros.

__kernel void kernelB(__global unsigned int * output) 
{
  uint gid0 = get_global_id(0);
  uint gid1 = get_global_id(1);

  output[gid0] += gid1;
}

I'm expecting 6 6 6 6 ... as the sum of the gid1's (0 + 1 + 2 + 3). Instead I get 3 3 3 3 ... Is there a way to get this functionality? In general I need the sum of the results of each work-item in a work group.

EDIT: It seems it must be said, I'd like to solve this problem without atomics.

War es hilfreich?

Lösung

You need to use local memory to store the output from all work items. After the work items are done their computation, you sum the results with an accumulation step.

__kernel void kernelB(__global unsigned int * output) 
{
  uint item_id = get_local_id(0);
  uint group_id = get_group_id(0);

  //memory size is hard-coded to the expected work group size for this example
  local unsigned int result[4];

  //the computation
  result[item_id] = item_id % 3;

  //wait for all items to write to result
  barrier(CLK_LOCAL_MEM_FENCE);

  //simple O(n) reduction using the first work item in the group
  if(local_id == 0){
    for(int i=1;i<4;i++){
      result[0] += result[i];
    }
    output[group_id] = result[0];
  }
}

Andere Tipps

Multiple work items are accessing elements of global simultaneously and the result is undefined. You need to use atomic operations or write unique location per work item.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top