Background.
The question is how to optimize Delphi code to make it comparable to Java. And doing so without using asm coding.
Analysis.
In the given example, the algorithm is using floating point calculations. The performance and weak points in the compiler has been investigated in other answers. Theoretically the x64 bit compiler could perform better, since the SSE2 opcodes and registers can offer better optimization. So the compiler would be the bottleneck here.
It was also suggested that a better algorithm could improve the performance.
Let's look at this a little bit more.
Improving algorithm.
In the algorithm loop, the loop index i is used three times as a variable in the calculations. Since this forces an integer to float conversion each time (upon loading into a fpu or SSE2 register), it will have a big impact on performance. Let's investigate if we can help the compiler to optimize away those conversions.
procedure xxx (n: integer; m: integer);
var
t,ii: double;
i, j: integer;
d, r: double;
begin
t:= 0.0;
for j:= 1 to n do
begin
t:= t / 1000.0;
ii:= 1.0;
for i:= 1 to m do
begin
t:= t + ii / 999999.0;
d:= t * t + ii;
ii:= ii + 1.0;
r:= (t + d) / (200000.0 * ii);
t:= t - r;
end;
end;
writeln(t);
end;
Now we have a clean algorithm using only float values. For reference here is the java code:
public static void xxy(int n, int m)
{
double t;
int i, j;
double d, r, ii;
t = 0.0;
for (j = 1; j <= n; j++)
{
t = t / 1000.0;
ii = 1.0;
for (i = 1; i <= m; i++)
{
t = t + ii / 999999.0;
d = t * t + ii;
ii = ii + 1.0;
r = (t + d) / (200000.0 * ii);
t = t - r;
}
}
System.out.println(t);
}
Benchmark.
Using XE2 compiler.
x32 x64 Java(x64)
--------------------------
Original algorithm 23417ms 22293ms 22045ms
Updated algorithm 22362ms 14059ms 15507ms
The disassembly for the x64 code looks like this:
Project19.dpr.11: begin
000000000046ABC0 55 push rbp
000000000046ABC1 4883EC20 sub rsp,$20
000000000046ABC5 488BEC mov rbp,rsp
Project19.dpr.12: t:= 0.0;
000000000046ABC8 F20F1005B0000000 movsd xmm0,qword ptr [rel $000000b0]
000000000046ABD0 C7C001000000 mov eax,$00000001
000000000046ABD6 4189C8 mov r8d,ecx
000000000046ABD9 89C1 mov ecx,eax
000000000046ABDB 413BC8 cmp ecx,r8d
000000000046ABDE 7F7B jnle xxx + $9B
000000000046ABE0 4183C001 add r8d,$01
Project19.dpr.15: t:= t / 1000.0;
000000000046ABE4 F20F5E059C000000 divsd xmm0,qword ptr [rel $0000009c]
Project19.dpr.16: ii := 1.0;
000000000046ABEC F20F100D9C000000 movsd xmm1,qword ptr [rel $0000009c]
000000000046ABF4 C7C001000000 mov eax,$00000001
000000000046ABFA 4189D1 mov r9d,edx
000000000046ABFD 413BC1 cmp eax,r9d
000000000046AC00 7F50 jnle xxx + $92
000000000046AC02 4183C101 add r9d,$01
Project19.dpr.19: t:= t + ii / 999999.0;
000000000046AC06 660F28D1 movapd xmm2,xmm1
000000000046AC0A F20F5E1586000000 divsd xmm2,qword ptr [rel $00000086]
000000000046AC12 F20F58C2 addsd xmm0,xmm2
Project19.dpr.20: d:= t * t + ii;
000000000046AC16 660F28D0 movapd xmm2,xmm0
000000000046AC1A F20F59D0 mulsd xmm2,xmm0
000000000046AC1E F20F58D1 addsd xmm2,xmm1
Project19.dpr.21: ii := ii + 1.0;
000000000046AC22 F20F580D66000000 addsd xmm1,qword ptr [rel $00000066]
Project19.dpr.22: r:= (t + d) / (200000.0 * ii);
000000000046AC2A 660F28D8 movapd xmm3,xmm0
000000000046AC2E F20F58DA addsd xmm3,xmm2
000000000046AC32 660F28D1 movapd xmm2,xmm1
000000000046AC36 F20F591562000000 mulsd xmm2,qword ptr [rel $00000062]
000000000046AC3E F20F5EDA divsd xmm3,xmm2
000000000046AC42 660F29DA movapd xmm2,xmm3
Project19.dpr.23: t:= t - r;
000000000046AC46 F20F5CC2 subsd xmm0,xmm2
Project19.dpr.24: end;
000000000046AC4A 83C001 add eax,$01
000000000046AC4D 413BC1 cmp eax,r9d
000000000046AC50 75B4 jnz xxx + $46
000000000046AC52 90 nop
Project19.dpr.25: end;
000000000046AC53 83C101 add ecx,$01
000000000046AC56 413BC8 cmp ecx,r8d
000000000046AC59 7589 jnz xxx + $24
000000000046AC5B 90 nop
Project19.dpr.26: WriteLn(t);
000000000046AC5C 488B0DC5100100 mov rcx,[rel $000110c5]
000000000046AC63 660F29C1 movapd xmm1,xmm0
000000000046AC67 E874D7F9FF call @Write0Ext
000000000046AC6C 4889C1 mov rcx,rax
000000000046AC6F E88CD7F9FF call @WriteLn
000000000046AC74 E877AFF9FF call @_IOTest
Project19.dpr.27: end;
000000000046AC79 488D6520 lea rsp,[rbp+$20]
The extra integer to float conversions are gone and the registers are better used.
Extra optimization
For the x32 bit compiler, replacing 999999.0;
and 200000.0
with reciprocal constants (const cA : Double = 1.0/999999.0; cB : Double = 1.0/200000.0;) and using multiplication instead will also improve performance.