Poor C# optimizer performance?

https://stackoverflow.com/questions/17105773

31-05-2022
|

Question

I've just written a small example checking, how C#'s optimizer behaves in case of indexers. The example is simple - I just wrap an array in a class and try to fill its values: once directly and once by indexer (which internally accesses the data exactly the same way as the direct solution does).

    public class ArrayWrapper
    {
        public ArrayWrapper(int newWidth, int newHeight)
        {
            width = newWidth;
            height = newHeight;

            data = new int[width * height];
        }

        public int this[int x, int y]
        {
            get
            {
                return data[y * width + x];
            }
            set
            {
                data[y * width + x] = value;
            }
        }

        public readonly int width, height;
        public readonly int[] data;
    }

    public class Program
    {
        public static void Main(string[] args)
        {
            ArrayWrapper bigArray = new ArrayWrapper(15000, 15000);

            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            for (int y = 0; y < bigArray.height; y++)
                for (int x = 0; x < bigArray.width; x++)
                    bigArray.data[y * bigArray.width + x] = 12;
            stopwatch.Stop();

            Console.WriteLine(String.Format("Directly: {0} ms", stopwatch.ElapsedMilliseconds));

            stopwatch.Restart();
            for (int y = 0; y < bigArray.height; y++)
                for (int x = 0; x < bigArray.width; x++)
                    bigArray[x, y] = 12;
            stopwatch.Stop();

            Console.WriteLine(String.Format("Via indexer: {0} ms", stopwatch.ElapsedMilliseconds));

            Console.ReadKey();
        }
    }

Many SO posts taught me, that a programmer should highly trust optimizer to do its job. But in this case results are quite surprising:

Directly: 1282 ms
Via indexer: 2134 ms

(Compiled in Release configuration with the optimizations on, I double-checked).

That's a huge difference - no way being a statistical error (and it's both scalable and repeatable).

It's a very unpleasant surprise: in this case I'd expect the compiler to inline the indexer (it even does not include any range-checking), but it didn't do it. Here's the disassembly (note, that my comments are guesses on what is going on):

Direct

                    bigArray.data[y * bigArray.width + x] = 12;
000000a2  mov         eax,dword ptr [ebp-3Ch]  // Evaluate index of array
000000a5  mov         eax,dword ptr [eax+4] 
000000a8  mov         edx,dword ptr [ebp-3Ch] 
000000ab  mov         edx,dword ptr [edx+8] 
000000ae  imul        edx,dword ptr [ebp-10h]  
000000b2  add         edx,dword ptr [ebp-14h]  // ...until here
000000b5  cmp         edx,dword ptr [eax+4]    // Range checking
000000b8  jb          000000BF 
000000ba  call        6ED23CF5                 // Throw IndexOutOfRange
000000bf  mov         dword ptr [eax+edx*4+8],0Ch // Assign value to array

By indexer

                    bigArray[x, y] = 12;
0000015e  push        dword ptr [ebp-18h]       // Push x and y
00000161  push        0Ch                       // (prepare parameters)
00000163  mov         ecx,dword ptr [ebp-3Ch] 
00000166  mov         edx,dword ptr [ebp-1Ch] 
00000169  cmp         dword ptr [ecx],ecx 
0000016b  call        dword ptr ds:[004B27DCh]  // Call the indexer

(...)

                data[y * width + x] = value;
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  sub         esp,8 
00000006  mov         dword ptr [ebp-8],ecx 
00000009  mov         dword ptr [ebp-4],edx 
0000000c  cmp         dword ptr ds:[004B171Ch],0 // Some additional checking, I guess?
00000013  je          0000001A 
00000015  call        6ED24648                   
0000001a  mov         eax,dword ptr [ebp-8]        // Evaluating index
0000001d  mov         eax,dword ptr [eax+4] 
00000020  mov         edx,dword ptr [ebp-8] 
00000023  mov         edx,dword ptr [edx+8] 
00000026  imul        edx,dword ptr [ebp+0Ch] 
0000002a  add         edx,dword ptr [ebp-4]        // ...until here
0000002d  cmp         edx,dword ptr [eax+4]        // Range checking
00000030  jb          00000037 
00000032  call        6ED23A5D                     // Throw IndexOutOfRange exception
00000037  mov         ecx,dword ptr [ebp+8] 
0000003a  mov         dword ptr [eax+edx*4+8],ecx  // Actual assignment
                }
0000003e  nop 
0000003f  mov         esp,ebp 
00000041  pop         ebp 
00000042  ret         8                            // Returning

That's a total disaster (in terms of code optimization). So my questions are:

Why this code (quite simple, actually) was not optimized properly?
How can I modify this code, such that it is optimized as I wanted it to be (if possible)?
Can a programmer rely on C#'s optimizer as much as on C++'s one?

Ok, I know, that the last one is hard to answer. But lately I read many questions about C++ performance and was amazed how much can optimizer do (for example, total inlining of std::tie, two std::tuple ctors and overloaded opeartor < on the fly).

Edit: (in response to comments)

It seems, that actually that was still my fault, because I checked the performance while running the IDE. Now I ran the same program out of IDE and attached to it by debugger on-the-fly. Now I get:

Direct

                    bigArray.data[y * bigArray.width + x] = 12;
000000ae  mov         eax,dword ptr [ebp-10h] 
000000b1  imul        eax,edx 
000000b4  add         eax,ebx 
000000b6  cmp         eax,edi 
000000b8  jae         000001FA 
000000be  mov         dword ptr [ecx+eax*4+8],0Ch

Indexer

                    bigArray[x, y] = 12;
0000016b  mov         eax,dword ptr [ebp-14h] 
0000016e  imul        eax,edx 
00000171  add         eax,ebx 
00000173  cmp         eax,edi 
00000175  jae         000001FA 
0000017b  mov         dword ptr [ecx+eax*4+8],0Ch

These codes are exactly the same (in terms of CPU instructions). After running, the indexer version achieved even better results than direct one, but only (I guess) because of cache'ing. After putting the tests inside a loop, everything went back to normal:

Directly: 573 ms
Via indexer: 353 ms
Directly: 356 ms
Via indexer: 362 ms
Directly: 351 ms
Via indexer: 370 ms
Directly: 351 ms
Via indexer: 354 ms
Directly: 359 ms
Via indexer: 356 ms

Well; lesson learned. Even though compiling in Release mode, there is a huge difference, whether program is run in IDE or standalone. Thanks @harold for the idea.

Solution

Running code with the debugger immediately attached makes it generate slow code (unless you enable "Suppress JIT optimization on module load", but that makes debugging a little hard). The procedure I use to view the optimized assembly is to throw an exception (conditionally, say, if a static variable is 0, so the optimizer doesn't get too trigger-happy), and attach the debugger when it crashes. You'll probably have to go through the "Manually choose debuggers"-route. Also, make sure you enable "Show external code" (from the context menu on the call stack).

The code I got for the direct access was

innerloop:
  mov  eax,dword ptr [esi+8]   ; bigArray.width
  imul eax,ebx                 ; * y
  add  eax,edi                 ; + x
  mov  edx,dword ptr [ebp-14h] ; pointer to bigArray.data
  mov  ecx,dword ptr [ebp-10h] ; \
  cmp  eax,ecx                 ; |  bounds check
  jae  00000087                ; /
  mov  dword ptr [edx+eax*4+8],0Ch ; data[index] = 12
  inc  edi                     ; x++
  cmp  edi,dword ptr [esi+8]   ; \
  jl   innerloop               ; / if (x < bigArray.width) goto innerloop

And for the indexer:

innerloop:
  mov  eax,dword ptr [esi+8]   ; bigArray.width
  imul eax,ebx                 ; * y
  add  eax,edi                 ; + x
  mov  edx,dword ptr [ebp-14h] ; pointer to bigArray.data
  mov  ecx,dword ptr [ebp-10h] ; \
  cmp  eax,ecx                 ; |  bounds check
  jae  00000087                ; /
  mov  dword ptr [edx+eax*4+8],0Ch ; data[index] = 12
  inc  edi                     ; x++
  cmp  edi,dword ptr [esi+8]   ; \
  jl   innerloop               ; / if (x < bigArray.width) goto innerloop

This is not a paste-mistake, the code for the inner loop really was exactly the same.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow