Using sse and avx intrinsics to add a set of packed singles into one value

https://stackoverflow.com/questions/22722051

23-06-2023
|

문제

I have code that I am trying to speed up. First, I used the SSE intrinsics and saw significant gains. I am now trying to see if I can do similarly with the AVX intrinsics. The code, essentially, takes two arrays, adds or subtracts them as needed, squares the result and then sums all those squares together.

Below is a somewhat simplified version of the code using the sse intrinsics:

float chiList[4] __attribute__((aligned(16)));
float chi = 0.0;
__m128 res;
__m128 nres;
__m128 del;
__m128 chiInter2;
__m128 chiInter;
while(runNum<boundary)
{
    chiInter = _mm_setzero_ps();
    for(int i=0; i<maxPts; i+=4)
    {
        //load the first batch of residuals and deltas
        res = _mm_load_ps(resids+i);
        del = _mm_load_ps(residDeltas[param]+i);
        //subtract them
        nres = _mm_sub_ps(res,del);
        //load them back into memory
        _mm_store_ps(resids+i,nres);
        //square them and add them back to chi with the fused
        //multiply and add instructions
        chiInter = _mm_fmadd_ps(nres, nres, chiInter);
    }
    //add the 4 intermediate this way because testing 
    //shows it is faster than the commented out way below
    //so chiInter2 has chiInter reversed
    chiInter2 = _mm_shuffle_ps(chiInter,chiInter,_MM_SHUFFLE(0,1,2,3));
    //add the two
    _mm_store_ps(chiList,_mm_add_ps(chiInter,chiInter2));
    //add again
    chi=chiList[0]+chiList[1];
    //now do stuff with the chi^2
    //alternatively, the slow way
    //_mm_store_ps(chiList,chiInter);
    //chi=chiList[0]+chiList[1]+chiList[2]+chiList[3];
}

This gets me to my first question: Is there any way to do the last bit (where I am taking the the 4 floats in chiInter and summing them into one float) more elegantly?

Anyways, I am now trying to implement this using the avx intrinsics, most of this process is quite straightforward, unfortunately I am stalling trying to do the last bit, trying to compress the 8 intermediate chi values into a single value.

Below is a similarly simplified piece of code for the avx intrinsics:

float chiList[8] __attribute__((aligned(32)));
__m256 res;
__m256 del;
__m256 nres;
__m256 chiInter;
while(runNum<boundary)
{
    chiInter = _mm256_setzero_ps();
    for(int i=0; i<maxPts; i+=8)
    {
        //load the first batch of residuals and deltas
        res = _mm256_load_ps(resids+i);
        del = _mm256_load_ps(residDeltas[param]+i);
        //subtract them
        nres = _mm256_sub_ps(res,del);
        //load them back into memory
        _mm256_store_ps(resids+i,nres);
        //square them and add them back to chi with the fused
        //multiply and add instructions
        chiInter = _mm256_fmadd_ps(nres, nres, chiInter);
    }
    _mm256_store_ps(chiList,chiInter);
    chi=chiList[0]+chiList[1]+chiList[2]+chiList[3]+
        chiList[4]+chiList[5]+chiList[6]+chiList[7];
}

My second question is this: Is there some method like I pulled with the SSE's up above that will let me accomplish this final addition more quickly? or, if there is a better way to do what I did in the SSE intrinsics, does it have an equivalent for the AVX intrinsics?

해결책

This operation is called horizontal sum. Say you have a vector v={x0,x1,x2,x3,x4,x5,x6,x7}. First, extract the high/low parts so you have w1={x0,x1,x2,x3} and w2={x4,x5,x6,x7}. Now call _mm_hadd_ps(w1, w2), which gives: tmp1={x0+x1,x2+x3,x4+x5,x6+x7}. Again, _mm_hadd_ps(tmp1,tmp1) gives tmp2={x0+x1+x2+x3,x4+x5+x6+x7,...}. One last time, _mm_hadd_ps(tmp2,tmp2) gives tmp3={x0+x1+x2+x3+x4+x5+x6+x7,...}. You could also replace the first _mm_hadd_ps with a simple _mm_add_ps.

This is all untested and written from the doc. And no promise on the speed either...

Someone on the Intel forum shows another variant (look for HsumAvxFlt).

We can also look at what gcc suggests by compiling this code with gcc test.c -Ofast -mavx2 -S

float f(float*t){
  t=(float*)__builtin_assume_aligned(t,32);
  float r=0;
  for(int i=0;i<8;i++)
    r+=t[i];
  return r;
}

The generated test.s contains:

vhaddps %ymm0, %ymm0, %ymm0
vhaddps %ymm0, %ymm0, %ymm1
vperm2f128  $1, %ymm1, %ymm1, %ymm0
vaddps  %ymm1, %ymm0, %ymm0

I am a bit surprised the last instruction isn't vaddss, but I guess it doesn't matter much.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow