Why is it that, in C#, referencing a variable from a function argument is faster than from a private property?

Question 1

I believe it has to do with how the compiler translates this to CIL.

Not really. Performance doesn't directly depend on the CIL code, because that's not what's actually executed. What's executed is the JITed native code, so you should look at that when you're interested in performance.

So, let's look the the code generated for the DoSomething(int[]) loop:

mov         eax,dword ptr [ebx+4] ; get the length of the array
cmp         eax,0       ; if it's 0
jbe         0000018C    ; jump to code that throws IndexOutOfRangeException
cmp         eax,1       ; if it's 1, etc.
jbe         0000018C 
cmp         eax,2 
jbe         0000018C 
cmp         eax,3 
jbe         0000018C 
cmp         eax,4 
jbe         0000018C 
inc         esi         ; i++
cmp         esi,0F4240h ; if i < 1000000
jl          000000B7    ; loop again

What's interesting about this code is that there is no useful work done at all, most of the code is array bounds checking (why the code hasn't been optimized to perform this checking only once before the loop, I have no idea).

Also notice that the code is inlined, you're not paying the cost of a function call.

This code takes around 1.7 ms on my computer.

So, how does the loop for DoSomething() look like?

mov         ecx,dword ptr [ebp-10h]  ; access this
call        dword ptr ds:[001637F4h] ; call DoSomething()
inc         esi                      ; i++
cmp         esi,0F4240h              ; if i < 1000000
jl          00000120                 ; loop again

Okay, so this actually calls the method, no inlining this time. What does the method itself look like?

mov         eax,dword ptr [ecx+4] ; access this._arg1
cmp         dword ptr [eax+4],0   ; if its length is 0
jbe         00000022 ; jump to code that throws IndexOutOfRangeException
cmp         dword ptr [eax+4],1   ; etc.
jbe         00000022 
cmp         dword ptr [eax+4],2 
jbe         00000022 
cmp         dword ptr [eax+4],3 
jbe         00000022 
cmp         dword ptr [eax+4],4 
jbe         00000022 
ret                               ; bounds checks successful, return

Comparing with the previous version (and ignoring the overhead of the function call for now), this does three different memory accesses instead of just one, which could explain some of the performance difference. (I think the five accesses to eax+4 should be counted only as one, because otherwise the compiler would optimize them.)

This code runs in about 3.0 ms for me.

How much overhead does the method call take? We can check that by adding [MethodImpl(MethodImplOptions.NoInlining)] to the previously inlined DoSomething(int[]). The assembly now looks like this:

mov         ecx,dword ptr [ebp-10h]  ; access this
mov         edx,dword ptr [ebp-14h]  ; access r
call        dword ptr ds:[002937E8h] ; call DoSomething(int[])
inc         esi                      ; i++
cmp         esi,0F4240h              ; if i < 1000000
jl          000000A0                 ; loop again

Notice that r is now no longer kept in a register, it's instead on the stack, which will add another slowdown.

Now DoSomething(int[]):

push        ebp                   ; save ebp from caller to stack
mov         ebp,esp               ; write our own ebp
mov         eax,dword ptr [edx+4] ; read the length of the array
cmp         eax,0    ; if it's 0
jbe         00000021 ; jump to code that throws IndexOutOfRangeException
cmp         eax,1    ; etc.
jbe         00000021 
cmp         eax,2 
jbe         00000021 
cmp         eax,3 
jbe         00000021 
cmp         eax,4 
jbe         00000021 
pop         ebp      ; restore ebp
ret                  ; return

This code runs in about 3.2 ms for me. That's even slower than DoSomething(). What's going on?

Turns out, [MethodImpl(MethodImplOptions.NoInlining)] seems to cause those unnecessary ebp instructions. If I add that attribute to DoSomething(), it runs in 3.3 ms.

This means the difference between stack access and heap access is pretty small (but still measurable). The fact that the array pointer could be kept in a register when the method was inlined was probably more significant.

So, the conclusion is that the big difference you're seeing is because of inlining. The JIT compiler decided inline the code for DoSomething(int[]), but not for DoSomething(), which allowed the code for DoSomething(int[]) to be very efficient. The most likely reason for that is because the IL for DoSomething() is much longer (21 bytes vs. 46 bytes).

Also, you're not really measuring what you wrote (array accesses and multiplications), because that could be optimized out. So be careful with devising your microbenchmarks, so that the compiler can't ignore the code you actually wanted to measure.

Question 2

Several people have made a stack/heap distinction, but this is a false dichotomy; when the IL is compiled to machine code there are additional possibilities, such as passing arguments in registers, which is potentially even faster than getting them off of the stack. See Eric Lippert's great blog post The Truth About Value Types for more thoughts along these lines. In any case, a proper analysis of the performance difference will almost certainly require looking at the generated machine code, not at the IL, and will potentially depend on the version of the JIT compiler, etc.

Question 3

If that is your example, I would not be surprised to see that SomeFunction is being inlined. See Here

It is also entirely possible that the JIT isn't able to inline the second example.

You would need to look at the compiled code to prove this. I don't know of a deterministic way to know if something is inlined bar looking at the compiled code.

You could at least disprove caching by having another thread write to _ofThing and if you get similar results, while it is changing the read value, then it wouldn't be caching.

Question 4

Even if function is not inlined, referencing an arg can be faster because of cache locality: the arg is already in CPU cache.

It's worth to note that you put it in cache by calling this function so already paid this price.

Question 5

This is totally related to where your variable is stored. If it is on the Stack or on the heap. The following code is much faster because it is using a static variable for example :

private static AType _ofThing;
void SomeFunction() {
    DoSomething(_ofThing);
}

For more information about where variables are stored, please have a look at this excellent answer from Hans Passant

Question 6

When you call a method using their parameter, you are using a stack memory and when are using a global variable you are using a heap memory.

Stack

very fast access
don't have to explicitly de-allocate variables
space is managed efficiently by CPU, memory will not become fragmented
local variables only
limit on stack size (OS-dependent)
variables cannot be resized

Heap

variables can be accessed globally
no limit on memory size
(relatively) slower access
no guaranteed efficient use of space, memory may become fragmented over time as blocks of memory are allocated, then freed
variables can be resized

http://tutorials.csharp-online.net/Stack_vs._Heap