Question

I am trying to exploit the SIMD 512 offered by knc (Xeon Phi) to improve performance of the below C code using intel intrinsics. However, my intrinsic embedded code runs slower than auto-vectorized code

C Code

int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
radomize(myArray); //to fill some random data
int searchVal=24;
#pragma vector always
for(int i=0;i<SIZE;i++) {
   if (myArray[i]==searchVal) match++;
return match;

Intrinsic embedded code: In the below code I am first loading the array and comparing it with search key. Intrinsics return 16bit mask values that is reduced using _mm512_mask_reduce_add_epi32().

register int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
const int values[16]=\
                {   1,1,1,1,\
                    1,1,1,1,\
                    1,1,1,1,\
                    1,1,1,1,\
                };
__m512i const flag = _mm512_load_epi32((void*) values);
__mmask16 countMask;

__m512i searchVal = _mm512_set1_epi32(16);
__m512i kV = _mm512_setzero_epi32();


for (int i=0;i<SIZE;i+=16)
{
   // kV = _mm512_setzero_epi32();
    kV = _mm512_loadunpacklo_epi32(kV,(void* )(&myArray[i]));
    kV = _mm512_loadunpackhi_epi32(kV,(void* )(&myArray[i + 16]));

    countMask = _mm512_cmpeq_epi32_mask(kV, searchVal);
    match += _mm512_mask_reduce_add_epi32(countMask,flag);
}
return match;

I believe I have some how introduced extra cycles in this code and hence it is running slowly compared to the auto-vectorized code. Unlike SIMD128 which directly returns the value of the compare in 128bit register, SIMD512 returns the values in mask register which is adding more complexity to my code. Am I missing something here, there must be a way out to directly compare and keep count of successful search rather than using masks such as XOR ops.

Finally, please suggest me the ways to increase the performance of this code using intrinsics. I believe I can juice out more performance using intrinsics. This was at least true for SIMD128 where in using intrinsics allowed me to gain 25% performance.

Was it helpful?

Solution

I suggest the following optimizations:

  • Use prefetching. Your code performs very little computations, and almost surely bandwidth-bound. Xeon Phi has hardware prefetching only for L2 cache, so for optimal performance you need to insert prefetching instructions manually.
  • Use aligned read _mm512_load_epi32 as hinted by @PaulR. Use memalign function instead of malloc to guarantee that the array is really aligned on 64 bytes. And in case you will ever need misaligned instructions, use _mm512_undefined_epi32() as the source for the first misaligned load, as it breaks dependency on kV (in your current code) and lets the compiler do additional optimizations.
  • Unroll the array by 2 or use at least two threads to hide instruction latency.
  • Avoid using int variable as an index. unsigned int, size_t or ssize_t are better options.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top