Inline gcc assembly and local variables (double)

Question 1

The mistake here is that one has to be careful when using movupd. With this instruction, you actually copy 128 bit of memory, in and out.

By chance the first function can copy these values out too, but the second one, has got only 64 bit space in ret variable. As expected this corrupts stack, yields to undefined behaviour?
Substituting movupd with movlpd (or movhpd), things work a charm.

Am I still clobbering the right registers?

Following code works just fine when compiled with g++ -O3 -o asm_test asm_test.cpp

  void my_func(const double *in, double *out) {
    asm ("mov %0, %%r8" : : "r"(in));
    asm ("movhpd (%%r8), %%xmm0" :);
    asm ("movhpd (%%r8), %%xmm1" :);
    asm ("addpd %%xmm1, %%xmm0" :);
    asm ("movhpd %%xmm0, (%0)" : : "r"(out) : "memory", "%r8", "%xmm0", "%xmm1");
  }

  double my_func2(const double *in) {
    double  ret;

    asm("mov %0, %%r8" : : "r"(in));
    asm("movlpd (%%r8), %%xmm0" :);
    asm("movlpd (%%r8), %%xmm1" :);
    asm("addpd %%xmm1, %%xmm0" :);
    asm("movlpd %%xmm0, %0" : "=m"(ret) : : "memory", "%r8", "%xmm0", "%xmm1");

    return ret;
  }

Question 2

gcc inline assembly doesn't particularly like it if you have separate lines of asm() statements that are not actually independent. You'd better code the above like:

#include <xmmintrin.h> // for __m128d

static  void my_func(const double *in, double *out) {
    asm("movupd %1, %%xmm0\n"
        "movupd %1, %%xmm1\n"
        "addpd %%xmm1, %%xmm0\n"
        "movupd %%xmm0, %0"
        : "=rm"(*(__m128d*)out)
        : "rm"(*(__m128d*)in)
        : "%xmm0", "%xmm1");
}

static double my_func2(const double *in) {
    double ret;
    asm("movupd %1, %%xmm0\n"
        "movupd %1, %%xmm1\n"
        "addpd %%xmm1, %%xmm0\n"
        "movlpd %%xmm0, %0"
        : "=xm"(ret)
        : "rm"(*(__m128d*)in)
        : "%xmm0", "%xmm1");
    return ret;
}

because this lets the compiler choose where to put things (mem or reg). For your source, this inlines the following two blocks into main():

  1c:   66 0f 10 44 24 10       movupd 0x10(%rsp),%xmm0
  22:   66 0f 10 4c 24 10       movupd 0x10(%rsp),%xmm1
  28:   66 0f 58 c1             addpd  %xmm1,%xmm0
  2c:   66 0f 11 44 24 20       movupd %xmm0,0x20(%rsp)
[ ... ]
  63:   66 0f 10 44 24 10       movupd 0x10(%rsp),%xmm0
  69:   66 0f 10 4c 24 10       movupd 0x10(%rsp),%xmm1
  6f:   66 0f 58 c1             addpd  %xmm1,%xmm0
  73:   66 0f 13 44 24 08       movlpd %xmm0,0x8(%rsp)

This is _not optimal, though ... if you change it to:

static  void my_func(const double *in, double *out) {
    asm volatile("movapd %1, %0\n"
                 "addpd %1, %0"
                 : "=xm"((__m128d*)out)
                 : "x"(*(__m128d*)in));
}

you leave it to the compiler where to put the variables. The compiler detects that it can get away with not doing loads/stores at all ... as this gets inlined simply as:

  18:   66 0f 28 c1             movapd %xmm1,%xmm0
  1c:   66 0f 58 c1             addpd  %xmm1,%xmm0

since the compiler recognizes it's got all variables in registers / wants all returns in registers.

Although it's not at all necessary to do this using assembly; with a decent compiler (your gcc will do) the plain C/C++ version,

static void my_func(const double *in, double *out) {
    out[0] = in[0] + in[0];
    out[1] = in[1] + in[1];
}

is most likely going to be turned into no less efficient code.