Maybe you could add some information about how you enqueue this kernel - maybe with an inappropriate local work size? (In doubt, just pass null
as the local work size - OpenCL will choose an appropriate one).
But even in the best case, it's unlikely that you will see a speedup here. The computation that you are doing there is heavily memory-bound. In the first kernel, you are reading two elements from global memory, then performing a trivial subtraction/multiplication, and afterwards writing an element to global memory (and in the second kernel, it's not much different). The bottleneck here is simply not the computation, but the data transfer.
(BTW: Recently, I wrote a few general words about that in https://stackoverflow.com/a/22868938 ).
Maybe the new developments of Unified Memory, HSA, AMD Kaveri etc. will come for the rescue here, but this is still in an early stage.
EDIT: Maybe you could also describe in which context you are performing these computations. If you have further computations (kernels) that work on the results of this kernel, maybe they could be combined on order to improve the memory/computation ratio.