Your best optimization, if possible would be to use four different kernels. You are calling this kernel with a group size of more than one, problems start to arise when it comes to execution in parallel.
If it is at all possible, try to separate your global memory or use it in very careful, non-colliding ways. This should allow you to create four separate kernels, and get rid of the conditional code execution.
When the first if/case is encountered, some of the work items of the group will run the code, but the other 75% of your work items will wait. Most opencl devices, especially GPUs, operate in this way. When those first 25% of work items are done, they will wait while the next if/case code is executed.
This applies to all branching in opencl, eg if/else, switch, for, and while/do. Whenever some of your work items in a group don't satisfy the condition, they wait for the others that do satisfy it. Then the 'else' group of work items executes while the 'if' group waits.
Another way to look at it is comparing CPU and GPU hardware. CPUs have a lot transistors dedicated to branch prediction and cache memory. GPUs are much more vector-base in nature, and are only recently beginning to support some of the more advanced flow-control features of CPUs.