Yes, the most likely cause of the performance degradation is that the compiler can't unroll the loop. There's a few things you could try to improve the situation.
You could define the parameter as a preprocessor macro passed via your program build options. This is a common trick used to build values that are only known at runtime into kernels as compile-time constants. For example:
clBuildProgram(program, 1, &device, "-Dnum_loops=50000", NULL, NULL);
You could construct the build options dynamically using sprintf
to make this more flexible. Clearly this will only be worth it if you don't need to change the parameter often, so that the overhead of recompilation doesn't become a problem.
You could investigate whether your OpenCL platform uses any pragmas that can give the compiler hints about loop-unrolling. For example, some OpenCL compilers recognise #pragma unroll
(or similar). OpenCL 2.0 has an attribute for this: __attribute__((opencl_unroll_hint))
.
You could manually unroll the loop. How this would look depends on what assumptions you can make about the num_loops
parameter. For example, if you know (or can ensure) that it will always be a multiple of 4, you could do something like this:
for (int kk = 0; kk < num_loops;)
{
<... more code here ...>
kk++;
<... more code here ...>
kk++;
<... more code here ...>
kk++;
<... more code here ...>
kk++;
}
Even if you can't make such assumptions, you should still be able to perform manual unrolling, but it may require some extra work (for example, to finish any remaining iterations).