How can i optimize more my OpenCl kernel

Question 1

Your best optimization, if possible would be to use four different kernels. You are calling this kernel with a group size of more than one, problems start to arise when it comes to execution in parallel.

If it is at all possible, try to separate your global memory or use it in very careful, non-colliding ways. This should allow you to create four separate kernels, and get rid of the conditional code execution.

When the first if/case is encountered, some of the work items of the group will run the code, but the other 75% of your work items will wait. Most opencl devices, especially GPUs, operate in this way. When those first 25% of work items are done, they will wait while the next if/case code is executed.

This applies to all branching in opencl, eg if/else, switch, for, and while/do. Whenever some of your work items in a group don't satisfy the condition, they wait for the others that do satisfy it. Then the 'else' group of work items executes while the 'if' group waits.

Another way to look at it is comparing CPU and GPU hardware. CPUs have a lot transistors dedicated to branch prediction and cache memory. GPUs are much more vector-base in nature, and are only recently beginning to support some of the more advanced flow-control features of CPUs.

Question 2

Since OpenCL doesn't support function pointers, you are restricted to either if/else or switch. The performance of these two should be identical, it's just a difference in coding preference.

You could make things a little easier/cleaner with preprocessor macros. For example, you could do something like this:

#define CALL_CASE(i)    \
    case i:             \
        call_calc##i(); \
        break;          \

__kernel void perform (__global double* A, __global double* B) {
    int idx = get_global_id(0);
    switch (idx) {
        CALL_CASE(0);
        CALL_CASE(1);
        CALL_CASE(2);
        CALL_CASE(3);
        ... // etc
    }
}

If you are generating your call_calcX() functions automatically, it would be easy enough to also generate this switch block at the same time. If you are manually writing these call_calcX() functions, then it's only one extra line of code to add each function to the block. Not ideal, but not terrible either.

As per the comments above, this problem doesn't appear to be at all data-parallel, which will limit the ability to take advantage of the SIMD execution available in most OpenCL devices.