Why Is GCC Using Mov Instead Of Push In Function Calls?

Question 1

-mpush-args

Push instructions will be used to pass outgoing arguments when functions are called. Enabled by default.

-mno-push-args

Use PUSH operations to store outgoing parameters. This method is shorter and usually equally fast as method using SUB/MOV operations and is enabled by default. In some cases disabling it may improve performance because of improved scheduling and reduced dependencies.

-maccumulate-outgoing-args

If enabled, the maximum amount of space required for outgoing arguments will be computed in the function prologue. This is faster on most modern CPUs because of reduced dependencies, improved scheduling and reduced stack usage when preferred stack boundary is not equal to 2. The drawback is a notable increase in code size. This switch implies -mno-push-args.

Even -mpush-args enabled by default it is override by -maccumulate-outgoing-args which is enabled by default. Compiling passing option -mno-accumulate-outgoing-args explicitly could change the instructions to push.

Question 2

x86-64 System V passes the first 6 integer args in registers RDI, RSI, RDX, RCX, R8, R9. So in main we have mov $666, %edi (which zero-extends to the full RDI) to pass the 64-bit arg long john.

push can't write registers; nothing¹ can stop GCC from using mov to set registers, and you wouldn't want to. If you passed 7 or more args, GCC normally would use push in main to pass the 7th on the stack, because -mno-accumulate-outgoing-args is the default in modern GCC. push has been efficient on x86 since Pentium-M or so introduced a "stack engine" to track stack-pointer updates specially.

Sunil Bojanapally's answer covers those options, which are more relevant for 32-bit code where all args are passed on the stack. If you got here from searching on the title question, see that answer or Why does gcc use movl instead of push to pass function args? This answer is about the actual question, which is about what the callee does with its incoming arg in a debug build, not about how the arg is passed to it.

You're talking about the code inside the callee that stores that incoming arg to the stack. This isn't passing an arg, it's just a consequence of a debug build - every C variable gets a memory address unless declared register with the default -O0 anti-optimization level. Compilers emit instructions to store incoming register args to the stack.

In this case movq %rdi, -8(%rbp) is storing to the red zone below RSP, since worship() is a leaf function. The stack space is already effectively reserved (down to -128(%rsp), and at this point RBP=RSP).

And just to be clear, this is not part of the function call. Spilling incoming args to the stack inside the callee only happens in a debug build, not part of the calling convention.

If it had needed to sub $16, %rsp / mov-store / leave, e.g. if you'd compiled with -mno-red-zone, then yes it could have been an optimization to do that spill with push %rdi. But existing compilers don't do that optimization for initializing + creating locals.

push %rdi in worship would have required the compiler to use leave instead of just pop %rbp, which is slightly more expensive. And it would only align the stack to RSP%16 == 8 after push %rbp aligned it to RSP%16 == 0; compilers prefer to keep the stack aligned by 16 even when they're not making further function calls.

And of course if you'd just enabled optimization, worship would just be xor %eax,%eax / ret, not wasting instructions putting the register arg anywhere.

Footnote 1: -Oz (favour code-size without caring about speed) might use 3-byte push imm8 / pop rdi instead of 5-byte mov edi, imm32 to materialize a value in a register if it was in the -128..+127 range. But 666 isn't, so mov is also the smallest way to set a register to that value without any pre-existing known register values near that. (Code golf x86-64 machine code tips).

Question 3

Compilers like GCC are written by people who very carefully consider how to make often used code snippets (like function call/return) as efficient as possible. Sure, their solutions target the general case, in special cases there might be better options.