How to use modern CPU instructions in Delphi? (Java faster than Delphi?)

Question 1

According to your code, what is slow with the 32 bit Delphi compiler is the floating point arithmetic support, which is far from optimized, and copy a lot of content on/to the FPU stack.

In respect to floating point arithmetic, not only Java JITted code will be faster. Even modern JavaScript JIT compilers can be much better than Delphi!

This blog article is just a reference about this, and provide asm-level explanation about Delphi slowness for floating point:

enter image description here

But if you use the Delphi compiler targeting the Win64 platform, it will emit not x87 but SSE2 opcodes, and will be much faster. I suspect comparable to Java JITted executable.

And, in respect to Java, any Delphi executable will use much less memory than the JVM, so here, Delphi executables are perfectly on the track!

If you want your code to be faster, do not use asm nor low-level optimization trick, but change your algorithm. It could be order of magnitude faster than compilation hints. Dedicated process will be achieved with inlined asm opcodes - take a look at this great set of articles for such low level hacks. But it is not easy to master, and usually, proper software profiling and adding some cache is the best way to performance!

Question 2

To follow on from Arnaud's point - I actually compiled this in delphi for x86 and x64.

32-bit compiler :

Unit1.pas.36: t:= t / 1000.0;
0051274D DD45F0           fld qword ptr [ebp-$10]
00512750 D835E4275100     fdiv dword ptr [$005127e4]
00512756 DD5DF0           fstp qword ptr [ebp-$10]
00512759 9B               wait 
Unit1.pas.37: for i:= 1 to m do
0051275A 8B45F8           mov eax,[ebp-$08]
0051275D 85C0             test eax,eax
0051275F 7E57             jle $005127b8
00512761 8945D0           mov [ebp-$30],eax
00512764 C745EC01000000   mov [ebp-$14],$00000001
Unit1.pas.39: t:= t + i / 999999.0;
0051276B DB45EC           fild dword ptr [ebp-$14]
0051276E D835E8275100     fdiv dword ptr [$005127e8]
00512774 DC45F0           fadd qword ptr [ebp-$10]
00512777 DD5DF0           fstp qword ptr [ebp-$10]
0051277A 9B               wait 
Unit1.pas.40: d:= t * t + i;
0051277B DD45F0           fld qword ptr [ebp-$10]
0051277E DC4DF0           fmul qword ptr [ebp-$10]
00512781 DB45EC           fild dword ptr [ebp-$14]
00512784 DEC1             faddp st(1)
00512786 DD5DE0           fstp qword ptr [ebp-$20]
00512789 9B               wait 
Unit1.pas.41: r:= (t + d) / (200000.0 * (i + 1));
0051278A DD45F0           fld qword ptr [ebp-$10]
0051278D DC45E0           fadd qword ptr [ebp-$20]
00512790 8B45EC           mov eax,[ebp-$14]
00512793 40               inc eax
00512794 8945CC           mov [ebp-$34],eax
00512797 DB45CC           fild dword ptr [ebp-$34]
0051279A D80DEC275100     fmul dword ptr [$005127ec]
005127A0 DEF9             fdivp st(1)
005127A2 DD5DD8           fstp qword ptr [ebp-$28]
005127A5 9B               wait 
Unit1.pas.42: t:= t - r;
005127A6 DD45F0           fld qword ptr [ebp-$10]
005127A9 DC65D8           fsub qword ptr [ebp-$28]
005127AC DD5DF0           fstp qword ptr [ebp-$10]
005127AF 9B               wait 
Unit1.pas.43: end;
005127B0 FF45EC           inc dword ptr [ebp-$14]
Unit1.pas.37: for i:= 1 to m do
005127B3 FF4DD0           dec dword ptr [ebp-$30]
005127B6 75B3             jnz $0051276b
Unit1.pas.44: end;
005127B8 FF45E8           inc dword ptr [ebp-$18]

64-bit compiler

Unit1.pas.36: t:= t / 1000.0;
000000000059F94E F20F104548       movsd xmm0,qword ptr [rbp+$48]
000000000059F953 F20F5E05BD000000 divsd xmm0,qword ptr [rel $000000bd]
000000000059F95B F20F114548       movsd qword ptr [rbp+$48],xmm0
000000000059F960 C7C001000000     mov eax,$00000001
000000000059F966 8B5568           mov edx,[rbp+$68]
000000000059F969 894544           mov [rbp+$44],eax
000000000059F96C 395544           cmp [rbp+$44],edx
000000000059F96F 7F73             jnle xxx + $C4
000000000059F971 83C201           add edx,$01
Unit1.pas.39: t:= t + i / 999999.0;
000000000059F974 F20F2A4544       cvtsi2sd xmm0,dword ptr [rbp+$44]
000000000059F979 F20F5E059F000000 divsd xmm0,qword ptr [rel $0000009f]
000000000059F981 F20F104D48       movsd xmm1,qword ptr [rbp+$48]
000000000059F986 F20F58C8         addsd xmm1,xmm0
000000000059F98A F20F114D48       movsd qword ptr [rbp+$48],xmm1
Unit1.pas.40: d:= t * t + i;
000000000059F98F F20F104548       movsd xmm0,qword ptr [rbp+$48]
000000000059F994 F20F594548       mulsd xmm0,qword ptr [rbp+$48]
000000000059F999 F20F2A4D44       cvtsi2sd xmm1,dword ptr [rbp+$44]
000000000059F99E F20F58C1         addsd xmm0,xmm1
000000000059F9A2 F20F114538       movsd qword ptr [rbp+$38],xmm0
Unit1.pas.41: r:= (t + d) / (200000.0 * (i + 1));
000000000059F9A7 F20F104548       movsd xmm0,qword ptr [rbp+$48]
000000000059F9AC F20F584538       addsd xmm0,qword ptr [rbp+$38]
000000000059F9B1 8B4544           mov eax,[rbp+$44]
000000000059F9B4 83C001           add eax,$01
000000000059F9B7 F20F2AC8         cvtsi2sd xmm1,eax
000000000059F9BB F20F590D65000000 mulsd xmm1,qword ptr [rel $00000065]
000000000059F9C3 F20F5EC1         divsd xmm0,xmm1
000000000059F9C7 F20F114530       movsd qword ptr [rbp+$30],xmm0
Unit1.pas.42: t:= t - r;
000000000059F9CC F20F104548       movsd xmm0,qword ptr [rbp+$48]
000000000059F9D1 F20F5C4530       subsd xmm0,qword ptr [rbp+$30]
000000000059F9D6 F20F114548       movsd qword ptr [rbp+$48],xmm0
Unit1.pas.43: end;
000000000059F9DB 83454401         add dword ptr [rbp+$44],$01
000000000059F9DF 395544           cmp [rbp+$44],edx
000000000059F9E2 7590             jnz xxx + $54
000000000059F9E4 90               nop
Unit1.pas.44: end;
000000000059F9E5 83454001         add dword ptr [rbp+$40],$01
000000000059F9E9 394D40           cmp [rbp+$40],ecx
000000000059F9EC 0F855CFFFFFF     jnz xxx + $2E
000000000059F9F2 90               nop
Unit1.pas.45: writeln(t);
000000000059F9F3 488B0D9E150300   mov rcx,[rel $0003159e]

Oddly, in this case, the x87 fpu code was actually about ~5% faster. The conclusion is probably nothing more than the fact that Delphi's 32-bit/x87 compiler is extremely mature and rather well optimized and the 64-bit compiler probably has some room to refine in a bit of performance. I can easily see a few places where the SSE code can be optimized here; i, for example, could be stored in an XMM register and re-used rather than being re-converted each time with cvtsi2sd, d could be held in an XMM register for the next calculation rather than being stored and re-loaded, etc.

Unaligned MOVs in and out of XMM registers can be actualy surprisingly expensive. The actual SSE calculations are faster, but the excessive data movement is probably evening out the score. Maybe Java forces 16-byte alignment on the stack? I know MacOS does this and there are definite benefits for SSE to use aligned rather than unaligned moves (at the expense of consuming more stack space, of course).

For example

fild : 1 op, 9 latency (x87)
cvtsi2sd : 2 op, 12 latency (SSE)

or

fld : 1 op, 4 latency (x87)
movsd [r,m] : 2op, 4 latency (SSE)

Delphi's compiler, while emitting SSE instructions, still seems to be treating the workflow in a similar way to the way it would do so with the x87 unit, which isn't necessarily the best way. In either case, David is correct - the compiler is what it is. You can't do anything to change it.

Where I need fast mathematical routines, I still code them in ASM myself - this will generally be superior to anything any compiler can do because you can customize the behaviour to the exact calculations you are doing. I have old legacy 32-bit applications with hand-tuned SSE3 ASM algorithms for complex number arithmetic and matrix operations. The key is that you don't need to optimize everything - you only need to optimize the bottlenecks. This is a rather important point to keep aware of.

Question 3

I'm going to answer the meta question here: "why can't the Delphi compiler use more modern CPU instructions, and why can Java?"

Basically, there are two ways to compile code:

pre-compiled on the developers machine
post-compiled (including JITed) on the target machine

Examples of 1. include Delphi, C/C++, etc.
Examples of 2. include Java, .NET, JavaScript, etc.

Pre-compiled environments

Pre-compiled environments have you compile your code once, and run it on your target computers. Compiled programs cannot run on machines using older instruction sets than the compiled program uses. The minimum requirement is the lowest of what the compiler can do best, and the lowest architecture of all your the target machines. If you don't know your target machines, it limited by the compiler.

Post-compiled environments

Post-compiled environments compile on the target machine. You don't have to know what architecture it runs: the compiler that runs on it needs to know what it supports to take best benefit of it. The minimum requirement is the lowest of what the compiler can do best, and the architecture of the target machine.

The reason is that in a post-compiled, JITed or interpreted language environment, the compiler actually runs on the target machine. Which means that the compiler can use all features of that target architecture. It could even take into account aspects like physical memory, cache sizes or disk speed, and measure current run-time performance to do post-compile optimizations on running code.

Delphi and other tools

As for the Windows 32-bit Delphi compiler, I think the minimum requirement is 486 still or the Pentium (given the Pentium-Safe FDIV option). It uses x87 for CPU code because of that.
The Windows 64-bit Delphi compiler has a minimum requirement of SSE instructions which it uses for FPU code.
I've not yet checked the other compiler platforms on minimum requirements.

The minimum Delphi requirements have to do with the emphasis on backward compatibility.

Some other environments (most C/++ compilers, and likely others too) allow you to specify the minimum instruction set. Delphi hasn't. and I think the main reason is complexity of development and testing. The matrix (if it is indeed a 2-dimensional problem) of possibilities gets large really quickly.

JIT compilers usually do not fully support the latest hardware architecture benefits across the board, as it is very expensive to do so.

JIT compilers often do support optimizations for certain processor families (for instance when copying regions of memory).

I know that Java and .NET made quite some advances in this area over the last decade. There is a really good 2005 article on .NET JIT usage of CPU features.

Question 4

Background.

The question is how to optimize Delphi code to make it comparable to Java. And doing so without using asm coding.

Analysis.

In the given example, the algorithm is using floating point calculations. The performance and weak points in the compiler has been investigated in other answers. Theoretically the x64 bit compiler could perform better, since the SSE2 opcodes and registers can offer better optimization. So the compiler would be the bottleneck here.

It was also suggested that a better algorithm could improve the performance. Let's look at this a little bit more.

Improving algorithm.

In the algorithm loop, the loop index i is used three times as a variable in the calculations. Since this forces an integer to float conversion each time (upon loading into a fpu or SSE2 register), it will have a big impact on performance. Let's investigate if we can help the compiler to optimize away those conversions.

procedure xxx (n: integer; m: integer);
var
  t,ii: double;
  i, j: integer;
  d, r: double;
begin
  t:= 0.0;
  for j:= 1 to n do
  begin
    t:= t / 1000.0;
    ii:= 1.0;
    for i:= 1 to m do
    begin
      t:= t + ii / 999999.0;
      d:= t * t + ii;
      ii:= ii + 1.0;
      r:= (t + d) / (200000.0 * ii);
      t:= t - r;
    end;
  end;
  writeln(t);
end;

Now we have a clean algorithm using only float values. For reference here is the java code:

public static void xxy(int n, int m)
{
    double t;
    int i, j;
    double d, r, ii;
    t = 0.0;
    for (j = 1; j <= n; j++)
    {
      t = t / 1000.0;
      ii = 1.0;
      for (i = 1; i <= m; i++)
      {
            t = t + ii / 999999.0;
            d = t * t + ii;
            ii = ii + 1.0;
            r = (t + d) / (200000.0 * ii);
            t = t - r;
      }
    }
    System.out.println(t);
}

Benchmark.

Using XE2 compiler.

                     x32      x64     Java(x64) 
                     --------------------------          
Original algorithm   23417ms  22293ms 22045ms
Updated algorithm    22362ms  14059ms 15507ms

The disassembly for the x64 code looks like this:

Project19.dpr.11: begin
000000000046ABC0 55               push rbp
000000000046ABC1 4883EC20         sub rsp,$20
000000000046ABC5 488BEC           mov rbp,rsp
Project19.dpr.12: t:= 0.0;
000000000046ABC8 F20F1005B0000000 movsd xmm0,qword ptr [rel $000000b0]
000000000046ABD0 C7C001000000     mov eax,$00000001
000000000046ABD6 4189C8           mov r8d,ecx
000000000046ABD9 89C1             mov ecx,eax
000000000046ABDB 413BC8           cmp ecx,r8d
000000000046ABDE 7F7B             jnle xxx + $9B
000000000046ABE0 4183C001         add r8d,$01
Project19.dpr.15: t:= t / 1000.0;
000000000046ABE4 F20F5E059C000000 divsd xmm0,qword ptr [rel $0000009c]
Project19.dpr.16: ii := 1.0;
000000000046ABEC F20F100D9C000000 movsd xmm1,qword ptr [rel $0000009c]
000000000046ABF4 C7C001000000     mov eax,$00000001
000000000046ABFA 4189D1           mov r9d,edx
000000000046ABFD 413BC1           cmp eax,r9d
000000000046AC00 7F50             jnle xxx + $92
000000000046AC02 4183C101         add r9d,$01
Project19.dpr.19: t:= t + ii / 999999.0;
000000000046AC06 660F28D1         movapd xmm2,xmm1
000000000046AC0A F20F5E1586000000 divsd xmm2,qword ptr [rel $00000086]
000000000046AC12 F20F58C2         addsd xmm0,xmm2
Project19.dpr.20: d:= t * t + ii;
000000000046AC16 660F28D0         movapd xmm2,xmm0
000000000046AC1A F20F59D0         mulsd xmm2,xmm0
000000000046AC1E F20F58D1         addsd xmm2,xmm1
Project19.dpr.21: ii := ii + 1.0;
000000000046AC22 F20F580D66000000 addsd xmm1,qword ptr [rel $00000066]
Project19.dpr.22: r:= (t + d) / (200000.0 * ii);
000000000046AC2A 660F28D8         movapd xmm3,xmm0
000000000046AC2E F20F58DA         addsd xmm3,xmm2
000000000046AC32 660F28D1         movapd xmm2,xmm1
000000000046AC36 F20F591562000000 mulsd xmm2,qword ptr [rel $00000062]
000000000046AC3E F20F5EDA         divsd xmm3,xmm2
000000000046AC42 660F29DA         movapd xmm2,xmm3
Project19.dpr.23: t:= t - r;
000000000046AC46 F20F5CC2         subsd xmm0,xmm2
Project19.dpr.24: end;
000000000046AC4A 83C001           add eax,$01
000000000046AC4D 413BC1           cmp eax,r9d
000000000046AC50 75B4             jnz xxx + $46
000000000046AC52 90               nop
Project19.dpr.25: end;
000000000046AC53 83C101           add ecx,$01
000000000046AC56 413BC8           cmp ecx,r8d
000000000046AC59 7589             jnz xxx + $24
000000000046AC5B 90               nop
Project19.dpr.26: WriteLn(t);
000000000046AC5C 488B0DC5100100   mov rcx,[rel $000110c5]
000000000046AC63 660F29C1         movapd xmm1,xmm0
000000000046AC67 E874D7F9FF       call @Write0Ext
000000000046AC6C 4889C1           mov rcx,rax
000000000046AC6F E88CD7F9FF       call @WriteLn
000000000046AC74 E877AFF9FF       call @_IOTest
Project19.dpr.27: end;
000000000046AC79 488D6520         lea rsp,[rbp+$20]

The extra integer to float conversions are gone and the registers are better used.

Extra optimization

For the x32 bit compiler, replacing 999999.0; and 200000.0 with reciprocal constants (const cA : Double = 1.0/999999.0; cB : Double = 1.0/200000.0;) and using multiplication instead will also improve performance.

Question 5

How can we use modern CPU instructions in Delphi (without resorting to ASM)?

If the compiler won't emit the CPU instructions that you wish to use, then there is no alternative to generating the desired instructions yourself, for example by using the inline assembler.