Вопрос

I was to write a kernel to add two 3-dimension matrix within a limited area. I have my codes like

#define PREC float

typedef struct _clParameter clParameter;
struct _clParameter {
    size_t width;
    size_t minWidth;
    size_t maxWidth;
    size_t height;
    size_t minHeight;
    size_t maxHeight;
    size_t depth;
    size_t minDepth;
    size_t maxDepth;
};

__kernel void clMatrixBasicOperate1Add(
    __global const PREC * restrict in1,
    __global const PREC * restrict in2,
    __global PREC * restrict out,
    __private const clParameter par) {

    size_t sizeOfXY = par.width * par.height;

    // 3-Dimension matrix

    size_t X = get_global_size(0);
    size_t x = get_global_id(0);

    size_t Y = get_global_size(1);
    size_t y = get_global_id(1);

    size_t Z = get_global_size(2);
    size_t z = get_global_id(2);

    size_t endX = (par.maxWidth - par.minWidth + 1)     / X;
    size_t endY = (par.maxHeight - par.minHeight + 1)   / Y;
    size_t endZ = (par.maxDepth - par.minDepth + 1)     / Z;

    if(x<( (par.maxWidth    - par.minWidth  + 1) % X) )     endX += 1;
    if(y<( (par.maxHeight   - par.minHeight + 1) % Y) )     endY += 1;
    if(z<( (par.maxDepth    - par.minDepth  + 1) % Z) )     endZ += 1;

    for(size_t k=0;k<endZ;k++)
    for(size_t j=0;j<endY;j++)
    for(size_t i=0;i<endX;i++) {
        size_t index = (par.minDepth + k*Z+z) * sizeOfXY + (par.minHeight + j*Y+y) * par.width + (par.  minWidth + i*X +x);
        out[index] = in1[index] + in2[index];
    }

    // return
}

When I use Intel Kernel Builder For OpenCL API to build it, it told me that

Setting target instruction set architecture to: Default (Advanced Vector Extension (AVX))
OpenCL Intel CPU device was found!
Device name: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
Device version: OpenCL 1.2 (Build 83073)
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <clMatrixBasicOperate1Add> was not vectorized
Done.
Build succeeded!

I wanna know why clMatrixBasicOperate1Add was not vectorzied.

Это было полезно?

Решение

Your kernel cannot be vectorised partly because of the termination conditions in the for loops. These conditions all rely on variables calculated from your kernel's inputs. Therefore, at kernel compile time, the Intel OpenCL C compiler has no idea how many iterations those loops will perform and hence cannot optimise them at all. If you change the inner loop from for(size_t i=0;i<endX;i++) to for(size_t i=0;i<4;i++) then the kernel gets vectorized. Of course this change doesn't do what you want but at least your kernel gets vectorized :) .

I think the strategy you want to try is to vectorize along the X-dimension of your grid of threads. This means that you would launch 1/2 the number of threads along X but instead use the vload2 and vstore2 functions to read from and write to global memory. You could go with 4, 8 or 16 element vectors as well, in which case you would launch 1/4, 1/8 or 1/16 of your current number of threads along the X-dimension respectively.

Since you are using a second generation Core i7 and float data you will probably want to use float8, vload8 and vstore8 since your CPU supports AVX instructions that operate on 8 floating point values simultaneously. Note that this will not be performance portable, e.g. some GPUs work well up to float2 but performance drops off when using float4/8/16. Older CPUs using the AMD CPU runtime don't have access to AVX instructions, only SSE which used 4-element floating point vectors. Therefore, you should make the vector size a tunable parameter via a macro passed in the options for clBuildProgram using a string like "-D vectype=float4" for example.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top