Question

I have a program that is almost spending all his time computing loops like

for(int j = 0; j < BIGNUMBER; j++)
    for(int i = 0; i < SMALLNUMBER; i++)
        result += var[i] / sqrt((A[i].x-B[j].x)*(A[i].x-B[j].x)+(A[i].y-B[j].y)*(A[i].y-B[j].y)+(A[i].z-B[j].z)*(A[i].z-B[j].z));

Where 1.0/sqrt(...) computes the inverse of the norm of the difference between the two vectors A[i] = {A[i].x, A[i].y, A[i].z} and B[j] = {B[j].x, B[j].y, B[j].z} which is also the most costly part of the loop.

Is there any way to optimize the loop, even with some precision loss?


Update:

Here the assembly code of an unvectorized loop step with the worse latency of every instruction. You clearly see that the inverse square root is the bottleneck:

movsd   A(%rip), %xmm0      1
movsd   A+8(%rip), %xmm2    1
subsd   B(%rip), %xmm0      3
subsd   B+8(%rip), %xmm2    3
movsd   A+16(%rip), %xmm1   1
mulsd   %xmm0, %xmm0        5
subsd   B+16(%rip), %xmm1   3
mulsd   %xmm2, %xmm2        5
mulsd   %xmm1, %xmm1        5
addsd   %xmm2, %xmm0        3
addsd   %xmm1, %xmm0        3
movsd   .LC0(%rip), %xmm1   1
unpcklpd    %xmm0, %xmm0    1
cvtpd2ps    %xmm0, %xmm0    4
unpcklps    %xmm0, %xmm0    3
cvtps2pd    %xmm0, %xmm0    2
sqrtsd  %xmm0, %xmm0        58
divsd   %xmm0, %xmm1        32
mulsd   var(%rip), %xmm1    5
addsd   result(%rip), %xmm1 3
cvttsd2si   %xmm1, %eax     3
movsd   %xmm1, result(%rip) 1

(By the way I don't understand why is it doing unpcklpd cvtpd2ps unpcklps cvtps2pd.)

Was it helpful?

Solution

If you can arrange your vectors in AoSoA form (xxyyzzxxyyzzxxyyzz...) you can do this very efficiently with SSE or AVX (xxxxyyyyzzzz...). In the code below I assumed SSE2 which has vec_size=2 but it's easy to change this to AVX. But your code is likely memory bound and not compute bound so this is only going to be useful for small loops which fit in the L1 cache. It will also be faster to use single float since it does twice the number of flops and sqrt is one of the few functions which is actually slower for double than float.

resultv = _mm_setzero_pd(0);
for(int j = 0; j < BIGNUMBER; j+=vec_size) {
    bx = _mm_load_pd(&B[3*j+0*vec_size]);
    by = _mm_load_pd(&B[3*j+1*vec_size]);
    bz = _mm_load_pd(&B[3*j+2*vec_size]);
    for(int i = 0; i < SMALLNUMBER; i+=vec_size) {
        ax = _mm_load_pd(&A[3*i+0*vec_size]);
        ay = _mm_load_pd(&A[3*i+1*vec_size]);
        az = _mm_load_pd(&A[3*i+2*vec_size]);
        dx = _mm_sub_pd(ax,bx);
        dy = _mm_sub_pd(ay,by);
        dz = _mm_sub_pd(az,bz);
        mag2 = _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx,dx),_mm_mul_pd(dy,dy)), _mm_mul_pd(dz,dz));
        varv = _mm_load_pd(&var[i]);        
        resultv = _mm_add_pd(_mm_div_pd(varv, _mm_sqrt_pd(mag2)), resultv);
        //resultv = _mm_add_pd(_mm_mul_pd(varv, _mm_rsqrt_pd(mag2)), resultv);
    }
}
result = _mm_cvtsd_f64(_mm_hadd_pd(resultv,resultv));

OTHER TIPS

It is tempting to assume sqrt is the "bottleneck", but if you do this you can find out for sure. It could well be that

  (A[i].x-B[j].x)*(A[i].x-B[j].x)
+ (A[i].y-B[j].y)*(A[i].y-B[j].y)
+ (A[i].z-B[j].z)*(A[i].z-B[j].z)

is the bottleneck instead.

If you are depending on the compiler to optimize all those indexing operations, it might, but I like to be sure. As a first cut, I would write this in the inner loop:

// where Foo is the class of A[i] and B[j]
Foo& a = A[i];
Foo& b = B[j];
double dx = a.x - b.x, dy = a.y - b.y, dz = a.z - b.z;
result += var[i] / sqrt( dx*dx + dy*dy + dz*dz );

which is much easier for the compiler to optimize. Then if indexing is still a bottleneck, I would lift Foo& b = B[j] out of the inner loop, and step a pointer rather than writing A[i]. If it gets to the point where samples show that the inner for statement itself (testing for termination and incrementing) takes a noticeable fraction of time, I would unroll it some.

In a comment you said the bottleneck is still the computation of the inverse square root. Luckily for you, that is something that comes up a lot in graphics, and there is a very fancy algorithm to do it. There is a wikipedia article about it, an implementation in quake and a SO question 3.

This is good workplace for SIMD (SSE) instructions. If you compiler supports auto vectorization, get this option on (tuning data structures layout for real gain). If it supports intrinsics, you may use them.

Edit
Assembler code above doesn't exploit vectorization. It uses scalar SSE instructions. You may try to help compiler a bit - make structures of (X,Y,Z,Dummy) of float, not double. SSE vector instructions can treat 4 floats simultaneously (or 2 doubles).
(I think that some math libraries already contain SIMD functions for vector norm)

P.S. You may add SSE tag to question

By forcing the calculation in single-precision through (1.0f/sqrt(float(...))) and using #pragma GCC optimize ("Ofast") for the function, I was able to get the rsqrtss instruction, with a nice speedup (around 2 times faster) on the whole function. It actually break the auto-vectorization (probably because of the mix of single and double precision).

Assembly code:

movsd   56(%rax), %xmm0     
addq    $120, %rax
movsd   -72(%rax), %xmm2    
subsd   %xmm5, %xmm0
movsd   -56(%rax), %xmm1    
subsd   %xmm4, %xmm2
subsd   %xmm6, %xmm1
mulsd   %xmm0, %xmm0
mulsd   %xmm2, %xmm2
mulsd   %xmm1, %xmm1
addsd   %xmm2, %xmm0
addsd   %xmm1, %xmm0
unpcklpd    %xmm0, %xmm0
cvtpd2ps    %xmm0, %xmm0
rsqrtss %xmm0, %xmm1
mulss   %xmm1, %xmm0
mulss   %xmm1, %xmm0
mulss   %xmm7, %xmm1
addss   %xmm8, %xmm0
mulss   %xmm1, %xmm0
mulss   -40(%rax), %xmm0
cmpq    %rdx, %rax
unpcklps    %xmm0, %xmm0
cvtps2pd    %xmm0, %xmm0
addsd   %xmm0, %xmm3

But I don't understand the additional multiplications at the end.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top