Array boundaries check optimization in a for-loop

Question 1

I'm also using Win8 x64, .NET 4.5, Release build, outside of the debugger (this is an important one); I get:

0: 813ms vs 421ms
1: 439ms vs 420ms
2: 440ms vs 420ms
3: 431ms vs 429ms
4: 433ms vs 427ms
5: 424ms vs 437ms
6: 427ms vs 434ms
7: 430ms vs 432ms
8: 432ms vs 435ms
9: 430ms vs 430ms
10: 427ms vs 418ms
11: 422ms vs 421ms
12: 434ms vs 420ms
13: 439ms vs 425ms
14: 426ms vs 429ms
15: 426ms vs 426ms
16: 417ms vs 432ms
17: 442ms vs 425ms
18: 420ms vs 429ms
19: 420ms vs 422ms

The first pays a JIT / "fusion" cost, but overall it is about the same (some in each column look faster, but overall not much to speak about).

using System;
using System.Diagnostics;
static class Program
{
    static void Main()
    {
        var ar = new int[500000000];

        for (int j = 0; j < 20; j++)
        {
            var sw = Stopwatch.StartNew();
            var length = ar.Length;
            for (var i = 0; i < length; i++)
            {
                if (ar[i] == 0) ;
            }

            sw.Stop();
            long hoisted = sw.ElapsedMilliseconds;

            sw = Stopwatch.StartNew();
            for (var i = 0; i < ar.Length; i++)
            {
                if (ar[i] == 0) ;
            }
            sw.Stop();
            long direct = sw.ElapsedMilliseconds;

            Console.WriteLine("{0}: {1}ms vs {2}ms", j, hoisted, direct);
        }
    }
}

Question 2

I investigated this some more, and found it really difficult to make a benchmark that actually shows the effect of the bounds check elimination optimization.

First some problems with the old benchmark:

The disassembly showed that the JIT compiler was able to optimize the first version as well. That was a surprise to me, but the disassembly doesn't lie. This, of course, completely defeats the purpose of this benchmark. Fix: take the length as a function argument.
The array is too big, which means cache misses, which add a lot of noise to our signal. Fix: use a short array but loop over it multiple times.

But now the real problem: it's doing something excessively clever. There is no array bounds test in the inner loop, even when the length of the loop comes from a function argument. The generated code is different, but the inner loop is essentially the same. Not completely (different registers and such) but it follows the same pattern:

_loop: mov eax, [somewhere + index]
       add index, 4
       cmp index, end
       jl _loop

There is no significant difference in execution time because there is no significant difference in the part of the generated code that matters most.

Question 3

I think the answer is that the garbage collector is running and changing your timings.

Disclaimer: I can't see the entire context of the OP code because you didn't post a compilable example; I'm assuming you are reallocating the array rather than reusing it. If not, then this is not the correct answer!

Consider this code:

using System;
using System.Diagnostics;

namespace Demo
{
    internal class Program
    {
        private static void Main(string[] args)
        {
            var ar = new int[500000000];
            test1(ar);
            //ar = new int[500000000]; // Uncomment this line.
            test2(ar);
        }

        private static void test1(int[] ar)
        {
            var sw = new Stopwatch();
            sw.Start();

            var length = ar.Length;
            for (var i = 0; i < length; i++)
            {
                if (ar[i] == 0);
            }

            sw.Stop();                
            Console.WriteLine("test1 took " + sw.Elapsed);
        }

        private static void test2(int[] ar)
        {
            var sw = new Stopwatch();
            sw.Start();

            for (var i = 0; i < ar.Length; i++)
            {
                if (ar[i] == 0);
            }

            sw.Stop();
            Console.WriteLine("test2 took " + sw.Elapsed);
        }
    }
}

On my system it prints:

test1 took 00:00:00.6643788
test2 took 00:00:00.3516378

If I uncomment the line marked // Uncomment this line. then the timings change to:

test1 took 00:00:00.6615819
test2 took 00:00:00.6806489

This is because of the GC collecting the previous array.

[EDIT] To avoid JIT startup costs, I put the entire test into a loop:

for (int i = 0; i < 8; ++i)
{
    test1(ar);
    ar = new int[500000000]; // Uncomment this line.
    test2(ar);
}

And then my results with the second array allocation commented out are:

test1 took 00:00:00.6437912
test2 took 00:00:00.3534027
test1 took 00:00:00.3401437
test2 took 00:00:00.3486296
test1 took 00:00:00.3470775
test2 took 00:00:00.3675475
test1 took 00:00:00.3501221
test2 took 00:00:00.3549338
test1 took 00:00:00.3427057
test2 took 00:00:00.3574063
test1 took 00:00:00.3566458
test2 took 00:00:00.3462722
test1 took 00:00:00.3430952
test2 took 00:00:00.3464017
test1 took 00:00:00.3449196
test2 took 00:00:00.3438316

And with the second array allocation enabled:

test1 took 00:00:00.6572665
test2 took 00:00:00.6565778
test1 took 00:00:00.3576911
test2 took 00:00:00.6910897
test1 took 00:00:00.3464013
test2 took 00:00:00.6638542
test1 took 00:00:00.3548638
test2 took 00:00:00.6897472
test1 took 00:00:00.4464020
test2 took 00:00:00.7739877
test1 took 00:00:00.3835624
test2 took 00:00:00.8432918
test1 took 00:00:00.3496910
test2 took 00:00:00.6471341
test1 took 00:00:00.3486505
test2 took 00:00:00.6527160

Note that test2 consistently takes longer due to the GC.

Unfortunately, the GC makes the timing results pretty meaningless.

For example, if I change the test code to this:

for (int i = 0; i < 8; ++i)
{
    var ar = new int[500000000];
    GC.Collect();
    test1(ar);
    //ar = new int[500000000]; // Uncomment this line.
    test2(ar);
}

With the line commented out I get:

test1 took 00:00:00.6354278
test2 took 00:00:00.3464486
test1 took 00:00:00.6672933
test2 took 00:00:00.3413958
test1 took 00:00:00.6724916
test2 took 00:00:00.3530412
test1 took 00:00:00.6606178
test2 took 00:00:00.3413083
test1 took 00:00:00.6439316
test2 took 00:00:00.3404499
test1 took 00:00:00.6559153
test2 took 00:00:00.3413563
test1 took 00:00:00.6955377
test2 took 00:00:00.3364670
test1 took 00:00:00.6580798
test2 took 00:00:00.3378203

And with it uncommented:

test1 took 00:00:00.6340203
test2 took 00:00:00.6276153
test1 took 00:00:00.6813719
test2 took 00:00:00.6264782
test1 took 00:00:00.6927222
test2 took 00:00:00.6269447
test1 took 00:00:00.7010559
test2 took 00:00:00.6262000
test1 took 00:00:00.6975080
test2 took 00:00:00.6457846
test1 took 00:00:00.6796235
test2 took 00:00:00.6341214
test1 took 00:00:00.6823508
test2 took 00:00:00.6455403
test1 took 00:00:00.6856985
test2 took 00:00:00.6430923

I think the moral of this test is: The GC for this particular test is such a large overhead compared to the rest of the code that it is completely skewing the timing results, and they can't be trusted to mean anything.

Question 4

you are calling a property on the second one so it will be slower ar.Length