Using your program with increased size gives a stack overflow:
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
This is probably caused by too much laziness. Looking at the heap profile broken down by type, you get the following:
(Note: I modified your program as leftaroundabout pointed out)
This doesn't look good. You shouldn't require linear space for your algorithm. You seem to be holding your Double values longer than required. Makeing the strict solves the issue:
{-# LANGUAGE BangPatterns #-}
iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' !h !t0 !y0 = y0:(iterateRK y' h t1 y1)
where t1 = t0 + h
y1 = rk4 y' h t0 y0
With this modification, the new heap profile looks like this:
This looks much better, the memory usage is much lower. -sstderr` also confirms that we only spend 2.5% of the total time in the garbage collector after the modification:
%GC time 2.5% (2.9% elapsed)
Now, the haskell version is still about 40% slower than the C one (using user time):
$ time ./tesths; time ./testc
2.47e-321
./tesths 0,73s user 0,01s system 86% cpu 0,853 total
2.470328e-321
./testc 0,51s user 0,01s system 95% cpu 0,549 total
Increasing the number of iterations and using a heap-allocated array for the result storage in C lowers the difference once more:
time ./tesths; time ./testc
2.47e-321
./tesths 18,25s user 0,04s system 96% cpu 19,025 total
2.470328e-321
./testc 16,98s user 0,14s system 98% cpu 17,458 total
This is only a difference of about 9%.
But we can still do better. Using the stream-fusion package, we can eliminate the list completely while still keeping the decoupling. Here is the full code with that optimization included:
{-# LANGUAGE BangPatterns #-}
import qualified Data.List.Stream as S
rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' !h !t !y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
where k1 = y' t y
k2 = y' (t + h/2) (y + ((h/2) * k1))
k3 = y' (t + h/2) (y + ((h/2) * k2))
k4 = y' (t + h) (y + (h * k3))
iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' h = curry $ S.unfoldr $ \(!t0, !y0) -> Just (y0, (t0 + h, rk4 y' h t0 y0))
main :: IO ()
main = do
let y' t y = -y
let h = 1e-3
let y0 = 1.0
let t0 = 0
let results = iterateRK y' h t0 y0
print $ S.head $ (S.drop (pred 10000000) results)
I comiled with:
$ ghc -O2 ./test.hs -o tesths -fllvm
Here are the timings:
$ time ./tesths; time ./testc
2.47e-321
./tesths 15,85s user 0,02s system 97% cpu 16,200 total
2.470328e-321
./testc 16,97s user 0,18s system 97% cpu 17,538 total
Now we're even a bit faster than C, because we do no allocations. To do a similar transformation to the C program, we have to merge the two loops into one and loose the nice abstraction. Even then, it's only as fast as haskell:
$ time ./tesths; time ./testc
2.47e-321
./tesths 15,86s user 0,01s system 98% cpu 16,141 total
2.470328e-321
./testc 15,88s user 0,02s system 98% cpu 16,175 total