Question

I am quite new to Haskell and I am wanting to create a histogram. I am using Data.Vector.Unboxed to fuse operations on the data; which is blazing fast (when compiled with -O -fllvm) and the bottleneck is my fold application; which aggregates the bucket counts.

How can I make it faster? I read about trying to reduce the number of thunks by keeping things strict so I've made things strict by using seq and foldr' but not seeing much performance increase. Your ideas are strongly encouraged.

import qualified Data.Vector.Unboxed as V

histogram :: [(Int,Int)]
histogram = V.foldr' agg [] $ V.zip k v
 where 
    n = 10000000
    c = 1000000
    k = V.generate n (\i -> i `div` c * c)
    v = V.generate n (\i -> 1)
    agg kv [] = [kv]
    agg kv@(k,v) acc@((ck,cv):as)
        | k == ck = let a = (ck,cv+v):as in a `seq` a
        | otherwise = let a = kv:acc in a `seq` a

main :: IO ()
main = print histogram 

Compiled with:

ghc --make -O -fllvm histogram.hs
Was it helpful?

Solution

First, compile the program with -O2 -rtsopts. Then, to get a first idea where you could optimize, run the program with the options +RTS -sstderr:

$ ./question +RTS -sstderr
[(0,1000000),(1000000,1000000),(2000000,1000000),(3000000,1000000),(4000000,1000000),(5000000,1000000),(6000000,1000000),(7000000,1000000),(8000000,1000000),(9000000,1000000)]
   1,193,907,224 bytes allocated in the heap
   1,078,027,784 bytes copied during GC
     282,023,968 bytes maximum residency (7 sample(s))
      86,755,184 bytes maximum slop
             763 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1964 colls,     0 par    3.99s    4.05s     0.0021s    0.0116s
  Gen  1         7 colls,     0 par    1.60s    1.68s     0.2399s    0.6665s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.67s  (  2.68s elapsed)
  GC      time    5.59s  (  5.73s elapsed)
  EXIT    time    0.02s  (  0.03s elapsed)
  Total   time    8.29s  (  8.43s elapsed)

  %GC     time      67.4%  (67.9% elapsed)

  Alloc rate    446,869,876 bytes per MUT second

  Productivity  32.6% of total user, 32.0% of total elapsed

Notice that 67% of your time is spent in GC! There is clearly something wrong. To find out what is wrong, we can run the program with heap profiling enabled (using +RTS -h), which produces the following figure:

First heap profile

So, you're leaking thunks. How does this happen? Looking at the code, the only time where a thunk is build up (recursively) in agg is when you do the addition. Making cv strict by adding a bang pattern thus fixes the issue:

{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as V

histogram :: [(Int,Int)]
histogram = V.foldr' agg [] $ V.zip k v
 where
    n = 10000000
    c = 1000000
    k = V.generate n (\i -> i `div` c * c)
    v = V.generate n id
    agg kv [] = [kv]
    agg kv@(k,v) acc@((ck,!cv):as) -- Note the !
        | k == ck = (ck,cv+v):as
        | otherwise = kv:acc

main :: IO ()
main = print histogram

Output:

$ time ./improved +RTS -sstderr 
[(0,499999500000),(1000000,1499999500000),(2000000,2499999500000),(3000000,3499999500000),(4000000,4499999500000),(5000000,5499999500000),(6000000,6499999500000),(7000000,7499999500000),(8000000,8499999500000),(9000000,9499999500000)]
     672,063,056 bytes allocated in the heap
          94,664 bytes copied during GC
     160,028,816 bytes maximum residency (2 sample(s))
       1,464,176 bytes maximum slop
             155 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       992 colls,     0 par    0.03s    0.03s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.03s    0.03s     0.0161s    0.0319s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.24s  (  1.25s elapsed)
  GC      time    0.06s  (  0.06s elapsed)
  EXIT    time    0.03s  (  0.03s elapsed)
  Total   time    1.34s  (  1.34s elapsed)

  %GC     time       4.4%  (4.5% elapsed)

  Alloc rate    540,674,868 bytes per MUT second

  Productivity  95.5% of total user, 95.1% of total elapsed

./improved +RTS -sstderr  1,14s user 0,20s system 99% cpu 1,352 total

This is much better.


So now you could ask, why did the issue appear, even though you used seq? The reason for this is the seq only forces the first argument to be WHNF, and for a pair, (_,_) (where _ are unevaluated thunks) is already WHNF! Also, seq a a is the same as a, because it seq a b (informally) means: evaluate a before b is evaluated, so seq a a just means: evaluate a before a is evaluated, and that is the same as just evaluating a!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top