Haskell performance: Struggling with utilizing profiling results and basic tuning techniques (eliminating explicit recursion, etc.)

Question 1

Using your program with increased size gives a stack overflow:

Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

This is probably caused by too much laziness. Looking at the heap profile broken down by type, you get the following:

Heao profile by type

(Note: I modified your program as leftaroundabout pointed out)

This doesn't look good. You shouldn't require linear space for your algorithm. You seem to be holding your Double values longer than required. Makeing the strict solves the issue:

{-# LANGUAGE BangPatterns #-}

iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' !h !t0 !y0 = y0:(iterateRK y' h t1 y1)
  where t1 = t0 + h
        y1 = rk4 y' h t0 y0

With this modification, the new heap profile looks like this:

New heap profile

This looks much better, the memory usage is much lower. -sstderr` also confirms that we only spend 2.5% of the total time in the garbage collector after the modification:

%GC     time       2.5%  (2.9% elapsed)

Now, the haskell version is still about 40% slower than the C one (using user time):

$ time ./tesths; time ./testc     
2.47e-321
./tesths  0,73s user 0,01s system 86% cpu 0,853 total
2.470328e-321
./testc  0,51s user 0,01s system 95% cpu 0,549 total

Increasing the number of iterations and using a heap-allocated array for the result storage in C lowers the difference once more:

time ./tesths; time ./testc
2.47e-321
./tesths  18,25s user 0,04s system 96% cpu 19,025 total
2.470328e-321
./testc  16,98s user 0,14s system 98% cpu 17,458 total

This is only a difference of about 9%.

But we can still do better. Using the stream-fusion package, we can eliminate the list completely while still keeping the decoupling. Here is the full code with that optimization included:

{-# LANGUAGE BangPatterns #-}
import qualified Data.List.Stream as S

rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' !h !t !y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
  where k1 = y' t y
        k2 = y' (t + h/2) (y + ((h/2) * k1))
        k3 = y' (t + h/2) (y + ((h/2) * k2))
        k4 = y' (t + h) (y + (h * k3))

iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' h = curry $ S.unfoldr $ \(!t0, !y0) -> Just (y0, (t0 + h, rk4 y' h t0 y0))

main :: IO ()
main = do
  let y' t y = -y
  let h = 1e-3
  let y0 = 1.0
  let t0 = 0
  let results = iterateRK y' h t0 y0
  print $ S.head $ (S.drop (pred 10000000) results)

I comiled with:

$ ghc -O2 ./test.hs -o tesths -fllvm

Here are the timings:

$ time ./tesths; time ./testc                    
2.47e-321
./tesths  15,85s user 0,02s system 97% cpu 16,200 total
2.470328e-321
./testc  16,97s user 0,18s system 97% cpu 17,538 total

Now we're even a bit faster than C, because we do no allocations. To do a similar transformation to the C program, we have to merge the two loops into one and loose the nice abstraction. Even then, it's only as fast as haskell:

$ time ./tesths; time ./testc
2.47e-321
./tesths  15,86s user 0,01s system 98% cpu 16,141 total
2.470328e-321
./testc  15,88s user 0,02s system 98% cpu 16,175 total

Question 2

I think that in order to make a fair comparison, you should exclude program initialization as well as printing the output (or measure it separately). By default, Haskell uses Strings which are lists of Chars and this makes output quite slow. Also Haskell has a complex runtime whose initialization can bias the results a lot for such a short task. You can use criterion library for that:

import Criterion.Main

-- ...

benchmarkIRK n =
    let y' t y = -y
        h      = 1e-3
        y0     = 1.0
        t0     = 0
    in take n (iterateRK y' h t0 y0)

benchmarkIRKPrint = writeFile "/dev/null" . show . benchmarkIRK

main = defaultMain
        [ bench "rk"      $ nf benchmarkIRK 1000
        , bench "rkPrint" $ nfIO (benchmarkIRKPrint 1000)
        ]

My measurements show that the actual computation takes something around 27 us, computing and printing takes around 350 us and running the whole program (without criterion) takes around 30 ms. So the actual computation takes just 1/1000 of the whole time and printing it just 1/100.

You should also measure your C program similarly, excluding any startup time and distinguishing what portion of time is consumed by computing and printing.

Question 3

The timings of your programs have very little to do with the languages' performance, and everything with terminal IO. Remove the printing of each step (BTW, putStrLn . show ≡≡ print) from your Haskell program, and you'll get

$ time RK-hs
1.0

real 0m0.004s
user 0m0.000s
sys 0m0.000s

... which isn't really significant, though – 1000 steps is far to little. With

main :: IO ()
main = do
    let y' t y = -y
        h = 1e-7
        y0 = 1.0
        t0 = 0
        results = iterateRK y' h t0 y0
    print . head $ drop 10000000 results

you get

$ time RK-hs +RTS -K100M
0.36787944117145965

real 0m0.653s
user 0m0.572s
sys 0m0.076s

while the equivalent in C has

$ time RK-c
Segmentation fault (core dumped)

oh great... ...but as you see, I had to increase the stack size for the Haskell program as well. Omitting the storage of the results in a stack-allocated array, we have

$ time RK-c
0.3678794412

real 0m0.152s
user 0m0.148s
sys 0m0.000s

so this is indeed faster, significantly now, than the Haskell version.

When even C has memory problems storing a whole lot of intermediate results (if you put it on the stack), this is worse in Haskell: each list node has to be heap-allocated seperately, and while allocation is much faster in Haskell's garbage-collected heap than in C's heap, it's still slow.

Haskell performance: Struggling with utilizing profiling results and basic tuning techniques (eliminating explicit recursion, etc.)

RK.hs

RK.c