Manually vectorized code 10x slower than auto optimized - what I did wrong?

Question

One of the main problems is that in your function dot_SSE you loop over n items when you should only loop over n/2 items (or n/4 with AVX).

To fix this with GCC's vector extensions you can do this:

double dot_double2(int n, double *a, double *b ) {
    typedef double double2 __attribute__ ((vector_size (16)));
    double2 sum2 = {};
    int i;
    double2* a2 = (double2*)a;
    double2* b2 = (double2*)b;
    for(i=0; i<n/2; i++) {
        sum2 += a2[i]*b2[i];
    }
    double dot = sum2[0] + sum2[1];
    for(i*=2;i<n; i++) dot +=a[i]*b[i]; 
    return dot;
}

The other problem with your code is that it has a dependency chain. Your CPU can do a simultaneous SSE addition and multiplication but only for independent data paths. To fix this you need to unroll the loop. The following code unrolls the loop by 2 (but you probably need to unroll by three for the best results).

double dot_double2_unroll2(int n, double *a, double *b ) {
    typedef double double2 __attribute__ ((vector_size (16)));
    double2 sum2_v1 = {};
    double2 sum2_v2 = {};
    int i;
    double2* a2 = (double2*)a;
    double2* b2 = (double2*)b;
    for(i=0; i<n/4; i++) {       
        sum2_v1 += a2[2*i+0]*b2[2*i+0];
        sum2_v2 += a2[2*i+1]*b2[2*i+1];
    }
    double dot = sum2_v1[0] + sum2_v1[1] + sum2_v2[0] + sum2_v2[1];
    for(i*=4;i<n; i++) dot +=a[i]*b[i]; 
    return dot;
}

Here is a version using double4 which I think is really what you wanted with your original dot_SSE function. It's ideal for AVX (though it still needs to be unrolled) but it will still work with SSE2 as well. In fact with SSE it seems GCC breaks it into two chains which effectively unrolls the loop by 2.

double dot_double4(int n, double *a, double *b ) {
    typedef double double4 __attribute__ ((vector_size (32)));
    double4 sum4 = {};
    int i;
    double4* a4 = (double4*)a;
    double4* b4 = (double4*)b;
    for(i=0; i<n/4; i++) {       
        sum4 += a4[i]*b4[i];
    }
    double dot = sum4[0] + sum4[1] + sum4[2] + sum4[3];
    for(i*=4;i<n; i++) dot +=a[i]*b[i]; 
    return dot;
}

If you compile this with FMA it will generate FMA3 instructions. I tested all these functions here (you can edit and compile the code yourself as well) http://coliru.stacked-crooked.com/a/273268902c76b116

Note that using SSE/AVX for a single dot production in matrix multiplication is not the optimal use of SIMD. You should do two (four) dot products at once with SSE (AVX) for double floating point.