Opencl size of local memory has impact on speed?

Question

This answer assumes you want to eventually reduce your histogram all the way down to 256 int values. You call the kernel with as many work groups as you have compute units on your device, and group size should be (as always) a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE on the device.

__kernel void histogram(__global struct gray * input, __global int * global_hist){
  int group_id = get_group_id(0);
  int num_groups = get_num_groups(0);
  int local_id = get_local_id(0);
  int local_size = get_local_size(0);

  volatile __local int histogram[256];

  int i;
  for(i=local_id; i<256; i+=local_size){
    histogram[i] = 0;
  }

  int rowNum, colNum, value, global_hist_offset

  for(rowNum = group_id; rowNum < 1080; rowNum+=num_groups){
    for(colNum = local_id; colNum < 1920; colNum += local_size){
      value = input[rowNum*1920 + colNum].i;
      atomic_inc(histogram[input]);
    }
  }

  barrier(CLK_LOCAL_MEM_FENCE);
  global_hist_offset = group_id * 256;
  for(i=local_id; i<256; i+=local_size){
    global_hist[global_hist_offset + i] = histogram[i];
  }

}

Each work group works cooperatively on one row of the image at a time. Then the group moves on to another row, calculated using the num_groups value. This will work well no matter how many groups you have. For example, if you have 7 compute units, group 3 (the forth group) will start on row 3 in the image, and then every 7th row thereafter. Group 3 would compute 153 rows in total, and its final row would be row 1074. Some work groups may compute 1 more row -- groups 0 and 1 in this example.

The same interlacing is done within the work group when looking at the columns of the image. in the colNum loop, the Nth work item starts at column N, and skips ahead by local_size columns. The remainder for this loop shouldn't come in to play as often, because CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will likely be a factor of 1920. Try all work group sizes from (1..X) * CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE up to the maximum work group size for your device.

One final point about this kernel: the results are not identical to your original kernel. Your global_hist array is 1080 * 256 integers. The one I have needs to be num_groups * 256 integers. This helps if you want a full reduction, because there is much less to add after the kernel executes.