IL and arguments

Question 1

MSIL works with a specification of a virtual machine. The mental model for the arguments passed to a method is of them being present in an array. Where Ldarg picks an element from that array to access the method argument and pushes it onto the evaluation stack. Opcodes.Ldarg_0 is an abbreviated version of the more general Opcodes.Ldarg IL instruction, it saves two bytes by always picking element 0. Same idea for Opcodes.Ldarg_1 for the 2nd argument. Very common of course, Ldarg only gets "expensive" when the method has more than 4 arguments. Emphasis on the double quotes, this is not the kind of expense you ever worry about.

The actual storage of arguments at runtime is very different. It depends on the jitter you use, different architectures use different ways to pass arguments. In general, the first few arguments are passed through cpu registers, the rest through the cpu stack. Processors like x64 or ARM have a lot of registers so pass more of the arguments using a register than x86. Governed by the rules of the __clrcall calling convention for that architecture.

Question 2

The IL (now known as CIL, Common Intermediate Language, not MSIL) describes operations on an imaginary stack machine. The JIT compiler takes the IL instructions and compiles it into machine code.

When calling a method, the JIT compiler has to adhere to a calling convention. This convention specifies how the arguments are passed to the called method, how the return value is passed back to the caller, and who is responsible for removing the arguments from the stack (the caller or the callee). In this example I use the cdecl calling convention, but actual JIT compilers use other conventions.

General approach

The exact details depend on the implementation, but the general approach used by the .NET and Mono JIT compilers for compiling CIL to machine code is as follows:

'Simulate' a stack and use it to turn all stack-based operations into operations on virtual registers (variables). There is a theoretical infinite number of virtual registers.
Turn all IL instructions into equivalent machine instructions.
Assign each virtual register to a real machine register. There is only a limited number of available machine registers. For example, the 32-bit x86 architecture has only 8 machine registers.

Of course, there is a lot of optimization going on between these steps.

Example

Let's take an example to explain these steps:

ldarg.1                     // Load argument 1 on the stack
ldarg.3                     // Load argument 3 on the stack
add                         // Pop value2 and value1, and push (value1 + value2)
call int32 MyMethod(int32)  // Pop value and call MyMethod, push result
ret                         // Pop value and return

In step 1 the IL is turned into register-based operations (operation dest <- src1, src2):

ldarg.1 %reg0 <-            // Load argument 1 in %reg0
ldarg.3 %reg1 <-            // Load argument 3 in %reg1
add %reg0 <- %reg0, %reg1   // %reg0 = (%reg0 + %reg1)
// Call MyMethod(%reg0), store result in %reg0
call int32 MyMethod(int32) %reg0 <- %reg0
ret <- %reg0                // Return %reg0

Then it is turned into machine instructions, e.g. x86:

mov %reg0, [addr_of_arg1]   // Move argument 1 in %reg0
mov %reg1, [addr_of_arg3]   // Move argument 3 in %reg1
add %reg0, %reg1            // Add %reg1 to %reg0

push %reg0                  // Push %reg0 on the real stack
call [addr_of_MyMethod]     // Call the method
add esp, 4

mov %reg0, eax              // Move the return value into %reg0
mov eax, %reg0              // Move %reg0 into the return value register EAX
ret                         // Return

Then each virtual register %reg0, %reg1 is assigned a machine register. For example:

mov eax, [addr_of_arg1]     // Move argument 1 in EAX
mov ecx, [addr_of_arg3]     // Move argument 3 in ECX
add eax, ecx                // Add ECX to EAX

push eax                    // Push EAX on the real stack
call [addr_of_MyMethod]     // Call the method
add esp, 4

mov ecx, eax                // Move the return value into ECX
mov eax, ecx                // Move ECX into the return value register EAX
ret                         // Return

Spilling

By choosing the registers carefully some mov instructions can be eliminated. When at any point in the code there are more virtual registers used than machine registers available, one machine register must be spilled to be used. When a machine register is spilled, instructions are inserted that push the register's value on the real stack. Later, when the spilled value has to be used again, instructions are inserted that pop the register's value from the real stack.

Conclusion

As you can see, the machine code doesn't use the real stack nearly as often as the IL code used the evaluation stack. The reason is that machine registers are the fastest memory elements of a processor, so the compiler tries to use them as best as possible. A value is only stored on the real stack when there is a shortage in machine registers, or when the value is required to be on the stack (e.g. due to a calling convention).

Question 3

ECMA-335 is probably a good starting point for this.

For example, section I.12.4.1 has this:

Instructions emitted by the CIL code generator contain sufficient information for different implementations of the CLI to use different native calling conventions. All method calls initialize the method state areas (see §I.12.3.2) as follows:

The incoming arguments array is set by the caller to the desired values.

The local variables array always has null for object types and for fields within value types that hold objects. In addition, if the localsinit flag is set in the method header, then the local variables array is initialized to 0 for all integer types and to 0.0 for all floating-point types. Value types are not initialized by the CLI, but verified code will supply a call to an initializer as part of the method’s entry point code.

The evaluation stack is empty.

and I.12.3.2 has:

Part of each method state is an array that holds local variables and an array that holds arguments. Like the evaluation stack, each element of these arrays can hold any single data type or an instance of a value type. Both arrays start at 0 (that is, the first argument or local variable is numbered 0). The address of a local variable can be computed using the ldloca instruction, and the address of an argument using the ldarga instruction.

Associated with each method is metadata that specifies:

whether the local variables and memory pool memory will be initialized when the method is entered.

the type of each argument and the length of the argument array (but see below for variable argument lists).

the type of each local variable and the length of the local variable array.

The CLI inserts padding as appropriate for the target architecture. That is, on some 64-bit architectures all local variables can be 64-bit aligned, while on others they can be 8-, 16-, or 32- bit aligned. The CIL generator shall make no assumptions about the offsets of local variables within the array. In fact, the CLI is free to reorder the elements in the local variable array, and different implementations might choose to order them in different ways.

And then in partition III, the description for callvirt (just as an example) has:

callvirt pops the object and the arguments off the evaluation stack before calling the method. If the method has a return value, it is pushed on the stack upon method completion. On the callee side, the obj parameter is accessed as argument 0, arg1 as argument 1, and so on.

Now this is all at a specification level. The actual implementation may well decide to just make the function call inherit the top n elements of the current method's stack, which means the arguments are already in the right place.