A quick test shows that in your example double4
arguments are passed on the stack but returned in registers xmm0 and xmm1. This is a bit weird. float4
arguments on the other hand are passed in registers xmm0 up to xmm7 and results returned in xmm0, as you would expect.
Apple uses the System V Application Binary Interface. AMD64 Architecture Processor Supplement. for Mac OS X. If I interpret that document correctly, everything should be passed in registers. I am not sure what clang is doing here. Maybe this is still work in progress and may change in the future? If they do, it may break your program when you try to mix old and new behavior.
For performance, passing vectors per value with clang is not a problem. If your functions are not extremely short, there should be no noticable difference. If you do use very small functions, you should try to convince the compiler to inline them (e.g. by declaring them static
).
EDIT: Regarding AVX extensions: if you enable them, the compiler uses registers ymm0 to ymm7 for arguments and ymm0 for results. In that case a double4 occupies a single ymm register instead of a xmm register pair.