Can't make work Opencl Sum Reduction from AMD sample reduction when changing WorkGroups dimension

StackOverflow https://stackoverflow.com/questions/10145514

  •  31-05-2021
  •  | 
  •  

Question

The following code comes from the amd website

__kernel
void reduce(__global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {

  int global_index = get_global_id(0);
  float accumulator = INFINITY;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator = (accumulator < element) ? accumulator : element;
    global_index += get_global_size(0);
  }

  // Perform parallel reduction
  int local_index = get_local_id(0);
  scratch[local_index] = accumulator;
  barrier(CLK_LOCAL_MEM_FENCE);
  for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset = offset / 2) {
    if (local_index < offset) {
      float other = scratch[local_index + offset];
      float mine = scratch[local_index];
      scratch[local_index] = (mine < other) ? mine : other;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
  }
  if (local_index == 0) {
     result[get_group_id(0)] = scratch[0];
  }
}

I adapted it to make it work as a sum reduction:

__kernel
void reduce(__global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {

  int global_index = get_global_id(0);
  float accumulator = 0.0;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator = accumulator + element;
    global_index += get_global_size(0);
  }

  // Perform parallel reduction
  int local_index = get_local_id(0);
  scratch[local_index] = accumulator;
  barrier(CLK_LOCAL_MEM_FENCE);
  for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset = offset / 2) {
    if (local_index < offset) {
      float other = scratch[local_index + offset];
      float mine = scratch[local_index];
      scratch[local_index] = mine + other;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
  }
  if (local_index == 0) {
     result[get_group_id(0)] = scratch[0];
  }
}

And it works like a charm when I use one only work group (meaning that i give NULL as local_work_size to clEnqueueNDRangeKernel()), but things get out of my control when I try to change the workgroup dimension. (I should say I am a total newbie in OpenCl)

What I do is as follows

#define GLOBAL_DIM 600
#define WORK_DIM 60

size_t global_1D[3] = {GLOBAL_DIM,1,1};
size_t work_dim[3] = {WORK_DIM,1,1};
err = clEnqueueNDRangeKernel(commands, av_velocity_kernel, 1, NULL, global_1D, work_dim, 0, NULL, NULL); //TODO CHECK THIS LINE
if (err)    {
  printf("Error: Failed to execute av_velocity_kernel!\n");            printf("\n%s",err_code(err));   fflush(stdout);      return EXIT_FAILURE;    }

Do I do it the wrong way??

Furthermore, I noticed that if I set #define GLOBAL_DIM 60000 (which is what I would need) I run out of local memory. DO I get "more" local memory if I use several work groups, or the local memory is evenly spread between workgroups??

Was it helpful?

Solution

First of all, those reduction kernels only work correctly if the workgroup size is a power of two. This means that instead of 60 you should use something 64. Also, there is no way that changing the GLOBAL_DIM makes you run out of local memory: you're most probably doing something wrong when invoking the kernel.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top