Variable in OpenCL kernel 'for-loop' reduces performance

Question 1

Yes, the most likely cause of the performance degradation is that the compiler can't unroll the loop. There's a few things you could try to improve the situation.

You could define the parameter as a preprocessor macro passed via your program build options. This is a common trick used to build values that are only known at runtime into kernels as compile-time constants. For example:

clBuildProgram(program, 1, &device, "-Dnum_loops=50000", NULL, NULL);

You could construct the build options dynamically using sprintf to make this more flexible. Clearly this will only be worth it if you don't need to change the parameter often, so that the overhead of recompilation doesn't become a problem.

You could investigate whether your OpenCL platform uses any pragmas that can give the compiler hints about loop-unrolling. For example, some OpenCL compilers recognise #pragma unroll (or similar). OpenCL 2.0 has an attribute for this: __attribute__((opencl_unroll_hint)).

You could manually unroll the loop. How this would look depends on what assumptions you can make about the num_loops parameter. For example, if you know (or can ensure) that it will always be a multiple of 4, you could do something like this:

for (int kk = 0; kk < num_loops;)
{
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
}

Even if you can't make such assumptions, you should still be able to perform manual unrolling, but it may require some extra work (for example, to finish any remaining iterations).

Question 2

The for loop evaluates the second statement in the (;;) repeatedly to determine if to continue the loop. Such conditional statements always cause control-flow to fork and discard unneeded computations, which is wasteful.

The correct way to do it, is to add another dimension to your kernel, and make that dimension entirely within one work-group so that it'll be executed sequentially inside one computation-unit.