Vectorizing/optimising loop with unaligned data access for wide registers (Xeon Phi in particular)

https://stackoverflow.com/questions/23328288

10-07-2023
|

Question

This is my first experience asking questions to the Stackoverflow community. Sorry if my question does not fit the forum's style/size - will improve with experience.

I am trying to vectorize a loop in C++ using Intel Compiler 14.0.1 to make better use of wide 512-bit registers for speed optimisation on Intel Xeon Phi. (inspired by https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization) and numerous references on Google that data alignment is much more important on Xeon Phi than on modern Xeon processors, where it is still important (one of them inside a nice overview https://indico.cern.ch/event/238763/material/slides/6.pdf on p.18).

This question is somewhat similar to unaligned memory accesses, but covers a simpler/more widespread example, and hopefully has a more definitive answer.

An example piece of code:

#include <malloc.h>


void func(float *const y, float  *const x, const int & N, const float & a0, const float & a1, const float & a2, const float & a3)
{
    __assume(N%16 == 0); // aim is to let compiler know that there is no residual loop (not sure if it works as expected, though)

    int i;
#pragma simd // to assume no vector dependencies
#pragma loop count min=16, avg=80, max=2048 // to let compiler know for which cases to optimize (not sure if it is beneficial)
//#pragma vector aligned // to let compiler know that all the arrays are aligned... but not in this case
    for (i = 0; i < N; i++)
    {
        y[i] = fmax(x[i + 1] * a0 + x[i] * a1, x[i] * a2 + a3);
    }

}

int main{

...
//y and x are _mm_malloced with 64 byte alignment, e.g.

float * y = (float *)_aligned_malloc(int_sizeBytes_x_or_y + 64, 64); //+64 for padding to enable vectorisation without using mask on the residual loop
float * x = (float *)_aligned_malloc(int_sizeBytes_x_or_y + 64, 64);
...
//M = 160 to 2048, more often 160 (a multiple of 16 - floats per register)
for (int k = 0; k < M; k++)
{
...
//int N = ceil(k / 16.0) * 16; // to have no residual loop, not sure if beneficial
...


func(y, x, N, a0, a1, a2, a3);


...
}
...
_aligned_free(x);
_aligned_free(y);
}

func() is called 150-2000 times in the body, re-using pre-allocated space for x and y (to avoid constant memory allocations, that, presumably, are relatively more time-consuming on Phi than on a normal Xeon). The body is repeated millions of times on each core.

The problem is that x[i] and x[i+1] are inherently unaligned for 512-bit vector engine, making vectorization sub-optimal due to misaligned memory access for x[i+1] part.

Would there be any benefits speed-wise to pre-allocate a 64-byte aligned _x once before the k++ loop, doing memcpy to fill the pre-allocated memory with forward values of x on every iteration of the k++ loop? (an equivalent of for (int j=0; j<N; j++) _x[0]=x[i+1]; with memcpy) so that #pragma vector aligned can be used inside func() with y[i] = fmax(_x[i] * a0 + x[i] * a1, x[i] * a2 + a3);?

Is there maybe some nice approach to handle this pretty widespread standard problem efficiently, to make best use of the vector engine?

Any suggestions on how to optimize vectorization for wide-register processors in general are also very welcome (that seems to be getting quite an interesting topic with the recent trend from Intel to enhance data as well as task parallelism)

La solution

Even in this case, it is good to let the compiler know that the arrays are aligned. As in: __assume_aligned(x,64) __assume_aligned(y,64)

As for the __assume(N%16 == 0), this can sometimes help but you will see it used most often in codes that have an inner and outer loop. The cost of the residual loop that is generated when N%16 does not equal 0 is minor if you only hit it once. However in this case you are calling the function repeatedly. So it might help for larger values of M.

It would not be a good idea to allocate a second array and populate it with the values starting at x[1]. The memcpy is too expensive compared to a slightly unaligned memory access.

You could try rewriting your code to use the _mm512_alignr_epi32 intrinsic. I tried to find a good example to point you to but haven't found one yet. But using _mm512_alignr_epi32 might not get you very much in this case, where you only using the 2 vectors.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow