It's not documented, but it should actually work with #pragma unroll
. Can you check the compiler log to see if the unroll is applied? I'm not sure if the kernel analyzer uses the same compiler as the OpenCL runtime, you might want to check.
Otherwise, if you know that n
comes in chunks of 256, you can unroll manually by having one loop over blocks of 256 elements and another one inside with a fixed size of 256, which might be easier to unroll. This will surely solve the problem that the trip count is not known statically.
However, keep in mind unrolling a loop is usually not that big of a win anyway, as you don't have many registers to cache your computation. The increased register pressure from the loop unrolling might lead to register spilling, which is even slower. You should check how fast the kernel actually is on the AMD card. A newer NVIDIA OpenCL compiler might also not benefit any more from the unroll pragma.