Your problem is that inline assembly does not replace the function. Your function compiles to this:
_foo:
push %rbp ; function prologue
mov %rsp,%rbp
mov %rdi,-0x8(%rbp)
mov %rsi,-0x10(%rbp)
mov %edx,-0x14(%rbp)
mov -0x14(%rbp),%eax
mov %eax,-0x1c(%rbp)
mov -0x14(%rbp),%ecx ; your code
mov -0x8(%rbp),%rax
mov -0x10(%rbp),%rdx
sub $0x4,%rsp
movss %xmm4,(%rsp)
flds (%rsp)
add $0x4,%rsp
retq ; your return
movss -0x18(%rbp),%xmm0 ; function epilogue
pop %rbp
retq ; gcc's return
retq
pops a value of the stack, and jumps to it. If everything goes right, it was a value pushed by callq
. gcc
generated a function prologue (the first two instructions above) including push %rbp
. So when your retq
runs, it pops rbp
(a pointer to the stack), and jumps to it. This is probably causing a segmentation fault because the stack is not executable (it could also be because %rax is an invalid pointer, if for some reason your stack is executable). The values on the stack that it happened to point to are 00 00
(which show up a lot in memory, unsurprisingly) and coincidentally disassemble to add %al,(%rax)
.
Now, I'm new to SSE, and I've only used GCC inline assembly a handful of times, so I'm not sure if this is a working solution. You really shouldn't be looking at the stack, or returning, because different compilers will have different function prologues the relative location of the arguments on the stack by the time your code runs.
Try something like:
#include <stdio.h>
float foo(float *x,float *y,unsigned int s)
{
float result;
__asm__ __volatile__(
"movss (%%rax),%%xmm4 \n\t" // xmm4 = *x
"movss (%%rdx),%%xmm5 \n\t" // xmm5 = *y
"addss %%xmm5,%%xmm4 \n\t" // xmm4 += xmm5
"movss %%xmm4,(%%rbx) \n\t" // result = xmm4
:
:"c"(s), "a"(x), "d"(y), "b"(&result) // ecx = s, eax = x, edx = y, ebx = &result
:"memory", "cc"
);
return result;
}
int main() {
float x = 1.0, y = 2.0;
printf("%f", foo(&x, &y, 99));
return 0;
}
All stack allocation, argument handling and returning is done in C. It also passes in a pointer for storing the float result.
This generates the following assembly, which is roughly what you were looking for:
_foo:
push %rbp ; prologue
mov %rsp,%rbp
push %rbx
lea -0xc(%rbp),%rbx ; set up registers
mov %edx,%ecx
mov %rdi,%rax
mov %rsi,%rdx
movss (%rax),%xmm4 ; your code
movss (%rdx),%xmm5
addss %xmm5,%xmm4
movss %xmm4,(%rbx)
movss -0xc(%rbp),%xmm0 ; retrieve result to xmm0 (the return register)
pop %rbx ; epilogue
pop %rbp
retq
The other option is always to write it in an assembly file, and link that with your C code later.
I hope that was somewhat helpful, but I'm sorry if it didn't fully answer your question.
Edit: updated code to something that actually runs for me.