OpenCL on Xeon Phi: 2D Convolution Experience - OpenCL vs OpenMP

Question 1

Intel's OpenCL implementation will use what they call "implicit vectorisation" in order to take advantage of vector floating point units. This involves mapping work-items onto SIMD lanes. In your example, each work-item is processing a single pixel, which means that each hardware thread will be processing 16 pixels at a time using the Xeon Phi's 512-bit vector units.

By contrast, your OpenMP code is parallelising across pixels, and then vectorising the computation within a pixel. This is almost certainly where the performance difference is coming from.

In order to get ICC to vectorize your OpenMP code in a manner that is similar to the implicitly vectorised OpenCL code, you should remove your #pragma ivdep and #pragma vector aligned statements from the innermost loop, and instead just place a #pragma simd in front of the horizontal pixel loop:

#pragma omp parallel for num_threads(nNumThreads)
for (int yOut = 0; yOut < nHeight; yOut++)
{
    const int yInTopLeft = yOut;

    #pragma simd
    for (int xOut = 0; xOut < nWidth; xOut++)
    {

When I compile this with ICC, it reports that it is successfully vectorising the desired loop.

Question 2

Previously: (with #pragma ivdep and #pragma vector aligned for inner inner-most loop):

Compiler output: 
Convolve.cpp(24): (col. 17) remark: LOOP WAS VECTORIZED

Program output:
120 Cores: 0.0087 ms

After advice by @jprice (with #pragma simd on horizontal-wise data):

Compiler output:
Convolve.cpp(24): (col. 9) remark: **SIMD** LOOP WAS VECTORIZED

Program output:
120 Cores: 0.00305

OpenMP now 2.8X faster compared to its previous execution. A fair comparison can now be made with OpenCL! Thanks jprice and to everyone who contributed. Learnt huge lessons from you all.

EDIT: Here are my results and comparison:

            image   filter  exec Time (ms)
OpenMP  2048x2048   3x3     4.3
OpenCL  2048x2048   3x3     1.04

Speedup: 4.1X

Indeed OpenCL can be this faster than OpenMP ?

Question 3

Your OpenMP program use one thread for a row of image.The pixels in the same row are vectorized. It equals you have one dimension workgroup in OpenCL. Each workgroup process one row of image. But in your OpenCL code, it seems that you have a two dimension workgroup. Each workgroup(mapped into one thread on phi) is processing a BLOCK of the image, not a ROW of image. The cache hit will be different.