سؤال

i am new in OpenCL and i am trying to compute histogram of grayscaled image. I am performing this computation on GPU nvidia GT 330M.

code is

__kernel void histogram(__global struct gray * input, __global int * global_hist, __local volatile int * histogram){
    int local_offset = get_local_id(0) * 256;
    int histogram_global_offset = get_global_id(0) * 256;
    int offset = get_global_id(0) * 1920;
    int value;
    for(unsigned int i = 0; i < 256; i++){
        histogram[local_offset + i] = 0;
    }
    barrier(CLK_LOCAL_MEM_FENCE);

    for(unsigned int i = 0; i < 1920; i++){
        value = input[offset + i].i;
        histogram[local_offset + value]++;
    }

    barrier(CLK_LOCAL_MEM_FENCE);
    for(unsigned int i = 0; i < 256; i++){
        global_hist[histogram_global_offset + i] = histogram[local_offset + i];
    }
}

This computation is performed on image 1920*1080.

I am firing kernels with

queue.enqueueNDRangeKernel(kernel_histogram, cl::NullRange, cl::NDRange(1080), cl::NDRange(1));

When local size of histogram is set to 256 * sizeof(cl_int) speed of this computation is (through nvidia nsight performance analysis) 11 675 microseconds.

Because local workgroup size is set to one. I tried increase local workgroup size to 8. But when i increase local size of histogram to 256 * 8 * sizeof(cl_int) and compute with local wg size 1. I get 85 177 microseconds.

So when i fire it with 8 kernels per workgroup i dont get speedup from 11ms but from 85ms. So final speed with 8 kernels per worgroup is 13 714 microseconds.

But when i create computation bug, set local_offset to zero and size of local histogram is 256 * sizeof(cl_int) and use 8 kernels per workgroup i get much better time - 3 854 microsec.

Does anybody have some ideas to speed up this computation ?

Thanks!

هل كانت مفيدة؟

المحلول

This answer assumes you want to eventually reduce your histogram all the way down to 256 int values. You call the kernel with as many work groups as you have compute units on your device, and group size should be (as always) a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE on the device.

__kernel void histogram(__global struct gray * input, __global int * global_hist){
  int group_id = get_group_id(0);
  int num_groups = get_num_groups(0);
  int local_id = get_local_id(0);
  int local_size = get_local_size(0);

  volatile __local int histogram[256];

  int i;
  for(i=local_id; i<256; i+=local_size){
    histogram[i] = 0;
  }

  int rowNum, colNum, value, global_hist_offset

  for(rowNum = group_id; rowNum < 1080; rowNum+=num_groups){
    for(colNum = local_id; colNum < 1920; colNum += local_size){
      value = input[rowNum*1920 + colNum].i;
      atomic_inc(histogram[input]);
    }
  }

  barrier(CLK_LOCAL_MEM_FENCE);
  global_hist_offset = group_id * 256;
  for(i=local_id; i<256; i+=local_size){
    global_hist[global_hist_offset + i] = histogram[i];
  }

}

Each work group works cooperatively on one row of the image at a time. Then the group moves on to another row, calculated using the num_groups value. This will work well no matter how many groups you have. For example, if you have 7 compute units, group 3 (the forth group) will start on row 3 in the image, and then every 7th row thereafter. Group 3 would compute 153 rows in total, and its final row would be row 1074. Some work groups may compute 1 more row -- groups 0 and 1 in this example.

The same interlacing is done within the work group when looking at the columns of the image. in the colNum loop, the Nth work item starts at column N, and skips ahead by local_size columns. The remainder for this loop shouldn't come in to play as often, because CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will likely be a factor of 1920. Try all work group sizes from (1..X) * CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE up to the maximum work group size for your device.

One final point about this kernel: the results are not identical to your original kernel. Your global_hist array is 1080 * 256 integers. The one I have needs to be num_groups * 256 integers. This helps if you want a full reduction, because there is much less to add after the kernel executes.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top