Eigen: Efficient implementation of matrix from C++ array

Question

Rather off-topic but since you stressed performance:

Eigen assembly isn't always optimal - there is some overhead due to poor register reuse and writing back to memory (this is not to blame Eigen by any means - doing this in generic templates is an impossible task).

If your kernels are fairly simple (QCD?), I would write assembly by hand (using intrinsics).

Here is your classical kernel rewritten in intrinsics, faster than Eigen version and same for Map/Matrix types (so you dont have to invent your own types).

EIGEN_DONT_INLINE void classic(double * __restrict__ VO, const double * __restrict__ AT, const double * __restrict__ VI) {
  __m128d vi01 = _mm_load_pd(VI+0);
  __m128d vi23 = _mm_load_pd(VI+2);
  for (int i = 0; i < 4; i += 2) {
    __m128d v00, v11;
    // v[i+0,i+0]                                                                                                                                                                                                   
    {
      int ii = i*4;
      __m128d at01 = _mm_load_pd(&AT[ii + 0]);
      __m128d at23 = _mm_load_pd(&AT[ii + 2]);
      v00 = _mm_mul_pd(at01, vi01);
      v00 = _mm_add_pd(v00, _mm_mul_pd(at23, vi23));
    }
    // v[i+1,i+1]                                                                                                                                                                                                   
    {
      int ii = i*4 + 4;
      __m128d at01 = _mm_load_pd(&AT[ii + 0]);
      __m128d at23 = _mm_load_pd(&AT[ii + 2]);
      v11 = _mm_mul_pd(at01, vi01);
      v11 = _mm_add_pd(v11, _mm_mul_pd(at23, vi23));
    }

    __m128d v = _mm_hadd_pd(v00, v11);
    // v is now [v00[0] + v00[1], v11[0] + v11[1]]                                                                                                                                                                               
    _mm_store_pd(VO+i, v);
    // VO[i] += AT[j+0 + i*4]*VI[j+0];                                                                                                                                                                              
    // VO[i] += AT[j+1 + i*4]*VI[j+1];                                                                                                                                                                              
  }
}

You may gain some additional improvement by interleaving loads and mul/adds - I tried to keep it simple.

These are the results:

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DCLASSIC_METHOD -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1 
elapsed time: 611.397 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DCLASSIC_METHOD -DUSE_MAPS -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1
elapsed time: 615.541 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DUSE_MAPS -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1
elapsed time: 981.941 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1 
elapsed time: 838.852 ms

On further note, you could possibly write a better simd kernel if you matrix was transposed - horizontal adds (_mm_hadd_pd) are expensive.

To add to discussion in comments: moving Eigen maps inside the function removes difference in time between map and matrix arguments.

EIGEN_DONT_INLINE void mapped(double (&VO)[4], const double (&AT)[4][4], const double (&VI)[4]) {
  Map<const Matrix4d,Aligned> A44(&AT[0][0]);
  Map<const Vector4d,Aligned> VIE(VI);
  Map<Vector4d,Aligned> VOE(VO);
  VOE.noalias() = A44.transpose()*VIE;
}

This is top of the assembly when passing Map to function (function not inlined)

    movq    (%rsi), %rcx
    movq    (%rdx), %rax
    movq    (%rdi), %rdx
    movapd  (%rcx), %xmm0
    movapd  16(%rcx), %xmm1
    mulpd   (%rax), %xmm0
    mulpd   16(%rax), %xmm1

compared to passing array reference (and map inside) or matrix

    movapd  (%rsi), %xmm0
    movapd  16(%rsi), %xmm1
    mulpd   (%rdx), %xmm0
    mulpd   16(%rdx), %xmm1