Lee's translation is already pretty good (well, he confused the odd and even cases(1)), but he fell into a couple of performance traps.
g n a p s =
if n > 0
then
let c = n `mod` 2
s' = (if c == 0 then (-) else (+)) s p
p' = p * a
in g (n-1) a p' s'
else s
He used
mod
instead ofrem
. The latter maps to machine division, the former performs additional checks to ensure a non-negative result. Thusmod
is a bit slower thanrem
, and if either satisfies the needs - because they yield identical results in the case where both arguments are non-negative; or because the result is only compared to 0 (both conditions are satisfied here) -rem
is preferable. Even better, and a bit more idiomatic is to useeven
(which usesrem
for the reasons mentioned above). The difference is not huge, though.No type signature. That means that the code is (type-class) polymorphic, and thus no strictness analysis is possible, nor any specialisations. If the code is used in the same module at a specific type, GHC can (and usually will, if optimisations are enabled) create a specialised version for that specific type that allows strictness analysis and some other optimisations (inlining of class methods like
(+)
etc.), in that case, one does not pay the polymorhism penalty. But if the use site is in a different module, that cannot happen. If (type-class) polymorphic code is desired, one should mark itINLINABLE
orINLINE
(for GHC < 7), so that its unfolding is exposed in the.hi
file and the function can be specialised and optimised at the use site.Since
g
is recursive, it cannot be inlined [meaning, GHC cannot inline it; in principle it is possible] at use sites, which often would enable more optimisations than a mere specialisation.One technique that often allows better optimisation for recursive functions is the worker/wrapper transformation. One creates a wrapper that calls a recursive (local) worker, then the non-recursive wrapper can be inlined, and when the worker is called with known arguments, that can enable further optimisations like constant folding or, in the case of function arguments, inlining. In particular the latter often has an enormous impact, when combined with a static-argument-transformation (arguments that never change in the recursive calls are not passed as arguments to the recursive worker).
In this case, we only have one static argument of type
Float
, so a worker/wrapper transformation with a SAT typically makes no difference (as a rule of thumb, a SAT pays off when- the static argument is a function
- several non-function arguments are static
so by this rule, we shouldn't expect any benefit from w/w + SAT, and in general, there is none). Here we have one special case where w/w + SAT can make a big difference, and that is when the factor
a
is 1. GHC has{-# RULES #-}
that eliminate multiplication by 1 for various types, and with such a short loop body, a multiplication more or less per iteration makes a difference, the running time is reduced by about 40% after points 3 and 4 have been applied. (There are noRULES
for multiplication by 0 or by-1
for floating point types because0*x = 0
resp.(-1)*x = -x
don't hold for NaNs.) For all othera
, the w/w + SATed{-# INLINABLE g #-} g n a p s = worker n p s where worker n p s | n <= 0 = s | otherwise = let s' = if even n then s + p else s - p in worker (n-1) a (p*a) s'
does not perform measurably different from the top-level recursive version with the same optimisations done.
Strictness. GHC's strictness analyser is good, but not perfect. It cannot see far enough through the algorithm to determine that the function is
- strict in
p
ifn >= 1
(assuming addition -(+)
- is strict in both arguments) - also strict in
a
ifn >= 2
(assuming strictness of(*)
in both arguments)
and then produce a worker that is strict in both. Instead you get a worker that uses an unboxed
Int#
forn
and an unboxedFloat#
fors
(I'm using the typeInt -> Float -> Float -> Float -> Float
here, corresponding to the C), and boxedFloat
s fora
andp
. Thus in each iteration you get two unboxings and a re-boxing. That costs (relatively) a lot of time, since besides that it's just a bit of simple arithmetic and tests. Help GHC along a bit, and make the worker (org
itself, if you don't do the worker/wrapper transform) strict inp
(bang pattern for example). That is enough to allow GHC producing a worker using unboxed values throughout.- strict in
Using division to test parity (not applicable if the type is
Int
and the LLVM backend is used).GHC's optimiser hasn't got down to the low-level bits very much yet, so the native code generator emits a division instruction for
x `rem` 2 == 0
and, when the rest of the loop body is as cheap as it is here, that costs a lot of time. LLVM's optimiser has already been taught to replace that with a bitmasking at type
Int
, so withghc -O2 -fllvm
you don't need to do that manually. With the native code generator, substituting that withx .&. 1 == 0
(needs
import Data.Bits
of course) produces a significant speedup (on normal platforms where a bitwise and is much faster than a division).
The final result
{-# INLINABLE g #-}
g n a p s = worker n p s
where
worker k !ap acc
| k > 0 = worker (k-1) (ap*a) (if k .&. (1 :: Int) == 0 then acc + ap else acc - ap)
| otherwise = acc
performs not measurably different (for the tested values) from the result of gcc -O3 -msse2 loop.c
, except for a = -1
, where gcc replaces the multiplication with a negation (assuming all NaNs equivalent).
(1) He's not alone in that,
c = n % 2;
if (!c) s += p;
else s -= p;
seems to be really tricky, as far as I can see everybody(2) got that wrong.
(2) With one exception ;)