Segfault on function termination

Question

Your problem is that inline assembly does not replace the function. Your function compiles to this:

_foo:
 push   %rbp              ; function prologue
 mov    %rsp,%rbp
 mov    %rdi,-0x8(%rbp)
 mov    %rsi,-0x10(%rbp)
 mov    %edx,-0x14(%rbp)
 mov    -0x14(%rbp),%eax
 mov    %eax,-0x1c(%rbp)

 mov    -0x14(%rbp),%ecx  ; your code
 mov    -0x8(%rbp),%rax
 mov    -0x10(%rbp),%rdx
 sub    $0x4,%rsp
 movss  %xmm4,(%rsp)
 flds   (%rsp)
 add    $0x4,%rsp
 retq                     ; your return

 movss  -0x18(%rbp),%xmm0 ; function epilogue
 pop    %rbp
 retq                     ; gcc's return

retq pops a value of the stack, and jumps to it. If everything goes right, it was a value pushed by callq. gcc generated a function prologue (the first two instructions above) including push %rbp. So when your retq runs, it pops rbp (a pointer to the stack), and jumps to it. This is probably causing a segmentation fault because the stack is not executable (it could also be because %rax is an invalid pointer, if for some reason your stack is executable). The values on the stack that it happened to point to are 00 00 (which show up a lot in memory, unsurprisingly) and coincidentally disassemble to add %al,(%rax).

Now, I'm new to SSE, and I've only used GCC inline assembly a handful of times, so I'm not sure if this is a working solution. You really shouldn't be looking at the stack, or returning, because different compilers will have different function prologues the relative location of the arguments on the stack by the time your code runs.

Try something like:

#include <stdio.h>

float foo(float *x,float *y,unsigned int s)
{
    float result;

    __asm__ __volatile__(
    "movss  (%%rax),%%xmm4 \n\t"       // xmm4 = *x
    "movss  (%%rdx),%%xmm5 \n\t"       // xmm5 = *y
    "addss  %%xmm5,%%xmm4  \n\t"       // xmm4 += xmm5

    "movss  %%xmm4,(%%rbx) \n\t"       // result = xmm4
    :
    :"c"(s), "a"(x), "d"(y), "b"(&result)  // ecx = s, eax = x, edx = y, ebx = &result
    :"memory", "cc"
    );

    return result;
}

int main() {
    float x = 1.0, y = 2.0;
    printf("%f", foo(&x, &y, 99));
    return 0;
}

All stack allocation, argument handling and returning is done in C. It also passes in a pointer for storing the float result.

This generates the following assembly, which is roughly what you were looking for:

_foo:
 push   %rbp              ; prologue
 mov    %rsp,%rbp
 push   %rbx

 lea    -0xc(%rbp),%rbx   ; set up registers
 mov    %edx,%ecx
 mov    %rdi,%rax
 mov    %rsi,%rdx

 movss  (%rax),%xmm4      ; your code
 movss  (%rdx),%xmm5
 addss  %xmm5,%xmm4
 movss  %xmm4,(%rbx)

 movss  -0xc(%rbp),%xmm0  ; retrieve result to xmm0 (the return register)

 pop    %rbx              ; epilogue
 pop    %rbp
 retq

The other option is always to write it in an assembly file, and link that with your C code later.

I hope that was somewhat helpful, but I'm sorry if it didn't fully answer your question.

Edit: updated code to something that actually runs for me.