SSE SIMD Optimierung For-Schleife

https://stackoverflow.com/questions/2918311

04-10-2019
|

Frage

Ich habe einige Code in einer Schleife

for(int i = 0; i < n; i++)
{
  u[i] = c * u[i] + s * b[i];
}

So, u und b sind Vektoren der gleichen Länge, und c und s sind Skalare. Ist dieser Code ein guter Kandidat für die Vektorisierung für die Verwendung mit SSE, um eine Beschleunigung zu bekommen?

UPDATE

Ich habe gelernt, Vektorisierung (Umdrehungen heraus, es nicht so schwer, wenn Sie intrinsics verwenden) und implementiert meine Schleife in SSE. Wenn jedoch den SSE2-Flag in der Compiler ++ VC Einstellung, erhalte ich über die gleiche Leistung wie mit meinem eigenen SSE-Code. Der Intel-Compiler auf der anderen Seite war viel schneller als mein SSE-Code oder der VC ++ Compiler.

Hier ist der Code, den ich als Referenz geschrieben

double *u = (double*) _aligned_malloc(n * sizeof(double), 16);
for(int i = 0; i < n; i++)
{
   u[i] = 0;
}

int j = 0;
__m128d *uSSE = (__m128d*) u;
__m128d cStore = _mm_set1_pd(c);
__m128d sStore = _mm_set1_pd(s);
for (j = 0; j <= i - 2; j+=2)
{
  __m128d uStore = _mm_set_pd(u[j+1], u[j]);

  __m128d cu = _mm_mul_pd(cStore, uStore);
  __m128d so = _mm_mul_pd(sStore, omegaStore);

  uSSE[j/2] = _mm_add_pd(cu, so);
}
for(; j <= i; ++j)
{
  u[j] = c * u[j] + s * omegaCache[j];
}

Lösung

Yes, this is an excellent candidate for vectorization. But, before you do so, make sure you've profiled your code to be sure that this is actually worth optimizing. That said, the vectorization would go something like this:

int i;
for(i = 0; i < n - 3; i += 4)
{
  load elements u[i,i+1,i+2,i+3]
  load elements b[i,i+1,i+2,i+3]
  vector multiply u * c
  vector multiply s * b
  add partial results
  store back to u[i,i+1,i+2,i+3]
}

// Finish up the uneven edge cases (or skip if you know n is a multiple of 4)
for( ; i < n; i++)
  u[i] = c * u[i] + s * b[i];

For even more performance, you can consider prefetching further array elements, and/or unrolling the loop and using software pipelining to interleave the computation in one loop with the memory accesses from a different iteration.

Andere Tipps

probably yes, but you have to help compiler with some hints. __restrict__ placed on pointers tells compiler that there is no alias between two pointers. if you know alignment of your vectors, communicate that to compiler (Visual C++ may have some facility).

I am not familiar with Visual C++ myself, but I have heard it is no good for vectorization. Consider using Intel compiler instead. Intel allows pretty fine-grained control over assembly generated: http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm

_mm_set_pd is not vectorized. If taken literally, it reads the two doubles using scalar operations, then combines the two scalar doubles and copy them into the SSE register. Use _mm_load_pd instead.

Yes, this is a great candidate for vectorizaton, assuming there is no overlap of U and B array. But the code is bound by memory access(load/store). Vectorization helps reduce cycles per loop, but the instructions will stall due to cache-miss on U and B array . The Intel C/C++ Compiler generates the following code with default flags for Xeon x5500 processor. The compiler unrolls the loop by 8 and employs SIMD ADD (addpd) and MULTIPLY (mulpd) instructions using xmm[0-16] SIMD registers. In each cycle, the processor can issue 2 SIMD instructions yielding 4-way scalar ILP, assuming you have the data ready in the registers.

Here U, B, C and S are Double Precision (8 bytes).

    ..B1.14:                        # Preds ..B1.12 ..B1.10
    movaps    %xmm1, %xmm3                                  #5.1
    unpcklpd  %xmm3, %xmm3                                  #5.1
    movaps    %xmm0, %xmm2                                  #6.12
    unpcklpd  %xmm2, %xmm2                                  #6.12
      # LOE rax rcx rbx rbp rsi rdi r8 r12 r13 r14 r15 xmm0 xmm1 xmm2 xmm3
    ..B1.15:     # Preds ..B1.15 ..B1.14
    movsd     (%rsi,%rcx,8), %xmm4                          #6.21
    movhpd    8(%rsi,%rcx,8), %xmm4                         #6.21
    mulpd     %xmm2, %xmm4                                  #6.21
    movaps    (%rdi,%rcx,8), %xmm5                          #6.12
    mulpd     %xmm3, %xmm5                                  #6.12
    addpd     %xmm4, %xmm5                                  #6.21
    movaps    16(%rdi,%rcx,8), %xmm7                        #6.12
    movaps    32(%rdi,%rcx,8), %xmm9                        #6.12
    movaps    48(%rdi,%rcx,8), %xmm11                       #6.12
    movaps    %xmm5, (%rdi,%rcx,8)                          #6.3
    mulpd     %xmm3, %xmm7                                  #6.12
    mulpd     %xmm3, %xmm9                                  #6.12
    mulpd     %xmm3, %xmm11                                 #6.12
    movsd     16(%rsi,%rcx,8), %xmm6                        #6.21
    movhpd    24(%rsi,%rcx,8), %xmm6                        #6.21
    mulpd     %xmm2, %xmm6                                  #6.21
    addpd     %xmm6, %xmm7                                  #6.21
    movaps    %xmm7, 16(%rdi,%rcx,8)                        #6.3
    movsd     32(%rsi,%rcx,8), %xmm8                        #6.21
    movhpd    40(%rsi,%rcx,8), %xmm8                        #6.21
    mulpd     %xmm2, %xmm8                                  #6.21
    addpd     %xmm8, %xmm9                                  #6.21
    movaps    %xmm9, 32(%rdi,%rcx,8)                        #6.3
    movsd     48(%rsi,%rcx,8), %xmm10                       #6.21
    movhpd    56(%rsi,%rcx,8), %xmm10                       #6.21
    mulpd     %xmm2, %xmm10                                 #6.21
    addpd     %xmm10, %xmm11                                #6.21
    movaps    %xmm11, 48(%rdi,%rcx,8)                       #6.3
    addq      $8, %rcx                                      #5.1
    cmpq      %r8, %rcx                                     #5.1
    jl        ..B1.15       # Prob 99%                      #5.1

it depends on how you placed u and b in memory. if both memory block are far from each other, SSE wouldn't boost much in this scenario.

it is suggested that the array u and b are AOE (array of structure) instead of SOA (structure of array), because you can load both of them into register in single instruction.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow