Question

Basically you have two ways for doing this:

for (int x = 0; x < UPPER_X; x++)
    for (int y = 0; y < UPPER_Y; y++)
    {        
        arr1[x, y] = get_value();
        arr2[y, x] = get_value();
    }

The only difference is what variable to change in the inner loop: first or second. I heard that the results differ from language to language.

What is right order for .NET?

Was it helpful?

Solution

The reason one is faster than the other has to do with the processor's cache, and the way the data is laid out in memory.

There are two normal ways to store the two-dimensional data is in one-dimensional address space, Either you can store all of the data for the first row, then the second row, and so on (aka row major order), or you can do it by columns (aka column major order). Below is what the memory locations would be for a 3x3 array for both of these options.

Rows:

1 2 3
4 5 6
7 8 9

Columns:

1 4 7
2 5 8
3 6 9

When you access one memory location, an entire cache line (which can be between 8 and 512 bytes, according to Wikipedia) is loaded into the cache. So if you access the next memory location it will already be in the cache. Thus, it can be much faster to access memory sequentially than to jump around in the address space. Thus, with large two-dimensional arrays, there can be a significant speed difference between choosing rows or columns as your inner loop variable.

OTHER TIPS

You should benchmark your particular circumstances to be sure.

You would think that there would be no difference for rectangular arrays (i.e. contiguously allocated memory), BUT according to this MSDN article there is a difference:

You can get even better results by converting a multidimensional array into a single-dimensional array. If you don't mind the syntax, this can be trivial; just use one index as an offset. For example, the following declares a single-dimensional array to be used as a two-dimensional array:

double[] myArray = new double[ROW_DIM * COLUMN_DIM];

For indexing the elements of this array, use the following offset:

myArray[row * COLUMN_DIM + column];

This will undoubtedly be faster than an equivalent jagged or rectangular array.

So I did the benchmark and the results are that access to arr1 is three times faster.

Very interesting, I never stopped to consider that can be such a huge difference simply by accessing the array indexes "non-sequentially". I gave it a try myself, and also found the following code had the second loop taking between 2 - 3 times longer :

// Hmmm, how to insert blank lines in the code formatter???
static void Main(string[] args)
{
    Stopwatch timer = new Stopwatch();
    int arraySize = 10000;

    // First array, access X by Y
    int[,] xbyy = new int[arraySize, arraySize];


    timer.Start();
    for (int x = 0; x < arraySize; x++)
        for (int y = 0; y < arraySize; y++)
        {
            xbyy[x, y] = 15;
        }
    timer.Stop();
    TimeSpan duration = timer.Elapsed;
    string realTimeFormat = string.Format("{0:00} minutes {1:00} seconds {2:000} milliseconds",
                    duration.Minutes, duration.Seconds, duration.Milliseconds);
    Console.WriteLine("X by Y took " + realTimeFormat);



    // Seecond array, access Y by X
    int[,] ybyx = new int[arraySize, arraySize];

    timer.Start();
    for (int x = 0; x < arraySize; x++)
        for (int y = 0; y < arraySize; y++)
        {
            ybyx[y, x] = 15;
        }
    timer.Stop();
    duration = timer.Elapsed;
    realTimeFormat = string.Format("{0:00} minutes {1:00} seconds {2:000} milliseconds",
                duration.Minutes, duration.Seconds, duration.Milliseconds);
    Console.WriteLine("Y by X took " + realTimeFormat);


    Console.ReadLine();
}

To keep things short, here are the emitted IL snippets for the X by Y loop and the Y by X loop.

Initial code prior to looping,

.method private hidebysig static void  Main(string[] args) cil managed
{
  .entrypoint
  // Code size       290 (0x122)
  .maxstack  4
  .locals init ([0] class [System]System.Diagnostics.Stopwatch timer,
           [1] int32 arraySize,
           [2] int32[0...,0...] xbyy,
           [3] int32 x,
           [4] int32 y,
           [5] valuetype [mscorlib]System.TimeSpan duration,
           [6] string realTimeFormat,
           [7] int32[0...,0...] ybyx,
           [8] int32 V_8,
           [9] int32 V_9)
  IL_0000:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
  IL_0005:  stloc.0
  IL_0006:  ldc.i4     0x2710
  IL_000b:  stloc.1

looping X by Y

  IL_000c:  ldloc.1
  IL_000d:  ldloc.1
  IL_000e:  newobj     instance void int32[0...,0...]::.ctor(int32,
                                                             int32)
  IL_0013:  stloc.2
  IL_0014:  ldloc.0
  IL_0015:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
  IL_001a:  ldc.i4.0
  IL_001b:  stloc.3
  IL_001c:  br.s       IL_003d
  IL_001e:  ldc.i4.0
  IL_001f:  stloc.s    y
  IL_0021:  br.s       IL_0034
  IL_0023:  ldloc.2
  IL_0024:  ldloc.3
  IL_0025:  ldloc.s    y
  IL_0027:  ldc.i4.s   15
  IL_0029:  call       instance void int32[0...,0...]::Set(int32,
                                                           int32,
                                                           int32)
  IL_002e:  ldloc.s    y
  IL_0030:  ldc.i4.1
  IL_0031:  add
  IL_0032:  stloc.s    y
  IL_0034:  ldloc.s    y
  IL_0036:  ldloc.1
  IL_0037:  blt.s      IL_0023
  IL_0039:  ldloc.3
  IL_003a:  ldc.i4.1
  IL_003b:  add
  IL_003c:  stloc.3
  IL_003d:  ldloc.3
  IL_003e:  ldloc.1
  IL_003f:  blt.s      IL_001e
  IL_0041:  ldloc.0

looping Y by X

  IL_0090:  ldloc.1
  IL_0091:  ldloc.1
  IL_0092:  newobj     instance void int32[0...,0...]::.ctor(int32,
                                                             int32)
  IL_0097:  stloc.s    ybyx
  IL_0099:  ldloc.0
  IL_009a:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
  IL_009f:  ldc.i4.0
  IL_00a0:  stloc.s    V_8
  IL_00a2:  br.s       IL_00c7
  IL_00a4:  ldc.i4.0
  IL_00a5:  stloc.s    V_9
  IL_00a7:  br.s       IL_00bc
  IL_00a9:  ldloc.s    ybyx
  IL_00ab:  ldloc.s    V_9
  IL_00ad:  ldloc.s    V_8
  IL_00af:  ldc.i4.s   15
  IL_00b1:  call       instance void int32[0...,0...]::Set(int32,
                                                           int32,
                                                           int32)
  IL_00b6:  ldloc.s    V_9
  IL_00b8:  ldc.i4.1
  IL_00b9:  add
  IL_00ba:  stloc.s    V_9
  IL_00bc:  ldloc.s    V_9
  IL_00be:  ldloc.1
  IL_00bf:  blt.s      IL_00a9
  IL_00c1:  ldloc.s    V_8
  IL_00c3:  ldc.i4.1
  IL_00c4:  add
  IL_00c5:  stloc.s    V_8
  IL_00c7:  ldloc.s    V_8
  IL_00c9:  ldloc.1
  IL_00ca:  blt.s      IL_00a4
  IL_00cc:  ldloc.0

The IL logic flow is somewhat similar. The main difference I can observe is the first loop manages to use stloc and ldloc for positions 2 and 3 for the first array and the X index variable. By the time it came to the second loop, it had expended the maxstack and thus used the stloc.s and ldloc.s instructions. I believe this is the difference between referencing variables on the stack versus on the heap, and contributing to the slower performance.

Now, if you swap the order in which the loops are tested, so that the Y by X loop gets run first to access stack references, you will see the timing durations get reversed.


UPDATE: I was wrong about referencing stack or heap addresses. It just seems that the first four variables in a method are more efficient to reference with the dedicated stloc.0, 1, 3, 4 and ldloc.0, 1, 3, 4 opcodes.

http://weblogs.asp.net/mnolton/archive/2004/01/09/48992.aspx

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top