This operation is called horizontal sum. Say you have a vector v={x0,x1,x2,x3,x4,x5,x6,x7}
. First, extract the high/low parts so you have w1={x0,x1,x2,x3}
and w2={x4,x5,x6,x7}
. Now call _mm_hadd_ps(w1, w2)
, which gives: tmp1={x0+x1,x2+x3,x4+x5,x6+x7}
. Again, _mm_hadd_ps(tmp1,tmp1)
gives tmp2={x0+x1+x2+x3,x4+x5+x6+x7,...}
. One last time, _mm_hadd_ps(tmp2,tmp2)
gives tmp3={x0+x1+x2+x3+x4+x5+x6+x7,...}
. You could also replace the first _mm_hadd_ps
with a simple _mm_add_ps
.
This is all untested and written from the doc. And no promise on the speed either...
Someone on the Intel forum shows another variant (look for HsumAvxFlt
).
We can also look at what gcc suggests by compiling this code with gcc test.c -Ofast -mavx2 -S
float f(float*t){
t=(float*)__builtin_assume_aligned(t,32);
float r=0;
for(int i=0;i<8;i++)
r+=t[i];
return r;
}
The generated test.s
contains:
vhaddps %ymm0, %ymm0, %ymm0
vhaddps %ymm0, %ymm0, %ymm1
vperm2f128 $1, %ymm1, %ymm1, %ymm0
vaddps %ymm1, %ymm0, %ymm0
I am a bit surprised the last instruction isn't vaddss
, but I guess it doesn't matter much.