Question

I have a for-loop in my kernel that I had hard-coded to iterate for a fixed number of loops of my code:

for (int kk = 0; kk < 50000; kk++)
{
  <... my code here ...>
}

I don't think the code in the loop is relevant to my question, it's some pretty simple table look-ups and integer math.

I wanted to make my kernel code a little more flexible so I modified the loop so that the number of iterations of my loop (50000) is replaced with a kernel input parameter 'num_loops'.

for (int kk = 0; kk < num_loops; kk++)
{
  <... more code here ...>
}

The thing I found is that even when my host program calls the kernel with

num_loops = 50000 

which is the same value as the previously hard-coded value, the performance of my kernel is cut almost in half.

I'm trying to figure out what is causing the performance degradation. I imagine it has something to do with the OpenCL compiler not being able to efficiently unroll the loop?

Is there a way to do what I'm trying to do without incurring the performance penalty?

UPDATE: Here are some results from playing with "#pragma unroll"

Unfortunately, it seems that unrolling the loops doesn't solve my performance issues.

Even unrolling the hard-coded loop degrades performance.

Here's the normal loop with the hard-coded value (best performance):

for (int kk = 0; kk < 50000; kk++)
// Time to execute = 0.18 (40180 Mi ops/sec)

If I unroll the loop, things get worse:

#pragma unroll
// or #pragma unroll 50000
for (int kk = 0; kk < 50000; kk++)
// Time to execute = 0.22 (33000 Mi ops/sec)

Here's the loop that uses a variable, num_loops = 50000:

for (int kk = 0; kk < num_loops; kk++)
// Time to execute = 0.26 (27760 Mi ops/sec)

#pragma unroll 50000
for (int kk = 0; kk < num_loops; kk++)
// Time to execute = 0.26 (27760 Mi ops/sec)

#pragma unroll
for (int kk = 0; kk < num_loops; kk++)
// Time to execute = 0.24 (30280 Mi ops/sec)

Things do get a little better when using the num_loops variable with the straight "#pragma unroll", however even that performance is still about 25% slower than the hard-coded, unrolled version.

Any other ideas on how to use num_loops as the loop variable without incurring a performance hit?

Was it helpful?

Solution

Yes, the most likely cause of the performance degradation is that the compiler can't unroll the loop. There's a few things you could try to improve the situation.


You could define the parameter as a preprocessor macro passed via your program build options. This is a common trick used to build values that are only known at runtime into kernels as compile-time constants. For example:

clBuildProgram(program, 1, &device, "-Dnum_loops=50000", NULL, NULL);

You could construct the build options dynamically using sprintf to make this more flexible. Clearly this will only be worth it if you don't need to change the parameter often, so that the overhead of recompilation doesn't become a problem.


You could investigate whether your OpenCL platform uses any pragmas that can give the compiler hints about loop-unrolling. For example, some OpenCL compilers recognise #pragma unroll (or similar). OpenCL 2.0 has an attribute for this: __attribute__((opencl_unroll_hint)).


You could manually unroll the loop. How this would look depends on what assumptions you can make about the num_loops parameter. For example, if you know (or can ensure) that it will always be a multiple of 4, you could do something like this:

for (int kk = 0; kk < num_loops;)
{
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
}

Even if you can't make such assumptions, you should still be able to perform manual unrolling, but it may require some extra work (for example, to finish any remaining iterations).

OTHER TIPS

The for loop evaluates the second statement in the (;;) repeatedly to determine if to continue the loop. Such conditional statements always cause control-flow to fork and discard unneeded computations, which is wasteful.

The correct way to do it, is to add another dimension to your kernel, and make that dimension entirely within one work-group so that it'll be executed sequentially inside one computation-unit.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top