I am not exactly familiar with Actors, but assuming that the tightest loop is indeed loop
you could avoid passing function f
as argument.
For one, applications of passed functions cannot take advantage of the strictness of the actual passed function. Rather, code generation must assume conservatively that the passed function takes its arguments lazily and returns a lazy result.
Second, in our case you use f
really just once here, so one can inline it. (This is how it is done in the scala code in the article you linked.)
Look at the code generated for the tail recursion in the following sample code that mimics yours:
test b c = loop 100 0 f
where
loop 0 !acc f = acc
loop n !acc f = loop (n-1) (acc + f (acc-1) (acc+1)) f -- tail recursion
f x y = 2*x + 7*y
We get there:
// arg2$f is the accumulator
arg$2 = arg$2f + (int)frege.runtime.Delayed.<java.lang.Integer>forced(
f_3237.apply(PreludeBase.INum_Int._minusƒ.apply(arg$2f, 1)).apply(
PreludeBase.INum_Int._plusƒ.apply(arg$2f, 1)
).result()
);
You see here that f
is called lazily which causes all the argument expressios to also be computed lazily. Note the number of method calls this requires!
In your case the code should still be something like:
(double)Delayed.<Double>forced(f.apply(acc).apply(curr).result())
This means, two closures are build with the boxed values acc and curr and then the result is computed, i.e. the function f
gets called with the unboxed arguments, and the result gets again boxed, just to get unboxed again (forced) for the next loop.
Now compare the following, where we just do not pass f
but call it directly:
test b c = loop 100 0
where
loop 0 !acc = acc
loop n !acc = loop (n-1) (acc + f (acc-1) (acc+1))
f x y = 2*x + 7*y
We get:
arg$2 = arg$2f + f(arg$2f - 1, arg$2f + 1);
Much better! Finally, in the case above we can do without a function call at all:
loop n !acc = loop (n-1) (acc + f) where
f = 2*x + 7*y
x = acc-1
y = acc+1
And this gets:
final int y_3236 = arg$2f + 1;
final int x_3235 = arg$2f - 1;
...
arg$2 = arg$2f + ((2 * x_3235) + (7 * y_3236));
Please try this out and let us know what happens. The main boost in performance should come from not passing f
, whereas the inlining will probably be done in the JIT anyway.
The additional cost with fold
is probably because you also had to create some list before applying it.