Your kernel cannot be vectorised partly because of the termination conditions in the for loops. These conditions all rely on variables calculated from your kernel's inputs. Therefore, at kernel compile time, the Intel OpenCL C compiler has no idea how many iterations those loops will perform and hence cannot optimise them at all. If you change the inner loop from for(size_t i=0;i<endX;i++)
to for(size_t i=0;i<4;i++)
then the kernel gets vectorized. Of course this change doesn't do what you want but at least your kernel gets vectorized :) .
I think the strategy you want to try is to vectorize along the X-dimension of your grid of threads. This means that you would launch 1/2 the number of threads along X but instead use the vload2 and vstore2 functions to read from and write to global memory. You could go with 4, 8 or 16 element vectors as well, in which case you would launch 1/4, 1/8 or 1/16 of your current number of threads along the X-dimension respectively.
Since you are using a second generation Core i7 and float data you will probably want to use float8, vload8 and vstore8 since your CPU supports AVX instructions that operate on 8 floating point values simultaneously. Note that this will not be performance portable, e.g. some GPUs work well up to float2 but performance drops off when using float4/8/16. Older CPUs using the AMD CPU runtime don't have access to AVX instructions, only SSE which used 4-element floating point vectors. Therefore, you should make the vector size a tunable parameter via a macro passed in the options for clBuildProgram using a string like "-D vectype=float4" for example.