Question

I'm just starting out learning OpenCL. I'm trying to get a feel for what performance gains to expect when moving functions/algorithms to the GPU.

The most basic kernel given in most tutorials is a kernel that takes two arrays of numbers and sums the value at the corresponding indexes and adds them to a third array, like so:

__kernel void 
add(__global float *a,
    __global float *b,
    __global float *answer)
{
    int gid = get_global_id(0);
    answer[gid] = a[gid] + b[gid];
}

__kernel void
sub(__global float* n,
    __global float* answer)
{
    int gid = get_global_id(0);
    answer[gid] = n[gid] - 2;
}

__kernel void
ranksort(__global const float *a,
         __global float *answer)
{
  int gid = get_global_id(0);
  int gSize = get_global_size(0);
  int x = 0;
  for(int i = 0; i < gSize; i++){
    if(a[gid] > a[i]) x++;
  }
  answer[x] = a[gid];
}

I am assuming that you could never justify computing this on the GPU, the memory transfer would out weight the time it would take computing this on the CPU by magnitudes (I might be wrong about this, hence this question).

What I am wondering is what would be the most trivial example where you would expect significant speedup when using a OpenCL kernel instead of the CPU?

Was it helpful?

Solution

if you have a sufficiently large set of matrices you intend to perform linear algebra operations on, or that you're essentially performing the same operation on each element, i would regard that as a trivial example. matrix multiplication, addition, fft's, convolution, etc. you'll see a bit of a speedup without doing much work. now if you want to see the 100x speedups then you need to delve into the memory management and know a fair bit about what's going on behind the scenes.

for getting started, i would recommend starting with pycuda since it is fairly simple to get started since it provides a very high level of abstraction and will allow you to jump in very quickly. check out this course on parallel computing using cuda from the university of illinois http://courses.ece.illinois.edu/ece498/al/ when you are ready to dive in further.

OTHER TIPS

depends on the definition of trivial. in my opinion it would be matrix matrix product, since it has O(3)/O(2) compute to memory ratio. Algorithms which exhibit similar ratios, are likely to benefit from being competed on GPU.

While your kernel is clearly very trivial it can be a useful example, it is completely memory bound since for every element you have two reads and one write, and only one arithmetic operation. There are some instructions to compute the address etc., but all of this amounts to practically nothing compared with the cost of accessing memory.

Assuming the data is already on the GPU, you can benefit from the GPU's very high bandwidth to the memory even for this simple kernel.

Of course, GPUs rely on you having enough threads to hide the memory latency, so your local work group size should be fairly large (say 256 or 512) and your global work group size should be very large (e.g. hundreds of thousands) for this to be effective, but that's kind of the point!

I know the Question is quite old but... I found that calculations of the Mandelbrot set is quite optimal for GPU. You have a complex input vector (float2) and a scalar output (int) and you'll have some hundred operations per input vector on average.

It could be used as a good example application, as it ...

  • has a 2-dimensional input dataset (calculates an image)
  • you can explain wavefronts and why 2 dimensional processing is beneficial in some cases
  • demonstrates vector data types
  • produces a picture, that is quickly verifiable by human eyes (debugging)
  • can be easily extended by: color mapping (__constant), float4 processing instead of float2 (optimization), producing int4 (R,G,B,A) output vectors (optimization). Reduction steps (RGBA) => (RGB)
  • needed math knowledge is acceptable (simple formula)

Regards, Stefan

After matrix multiplication I would say image convolution (such as blur, denoising etc). Check out AMD's tutorial.

What is "most trivial" is a matter of opinion, but I would say that computing an image of the Mandelbrot set is a pretty straightforward application using the GPU. Each point is totally independent of every other point, so you can start up a thread for each point and get tremendous speedup. The formula itself that is iterated is a simple quadratic function. I used it as an example in a tutorial that can be found on my blog here, just computing the numbers without even making an image to make it even simpler. Almost any embarrassingly parallel (see Wikipedia entry) problem is a good one to begin with.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top