Question

I have a problem where my code is creating too many thunks (over 270MB) and consequently spends way too much time (over 70%) in GC when grouping values by key. I was wondering what the best way to group values by key.

The problem is that I have keys and values represented by vectors and I want to group the values by keys preserving the order. For example:

Input:

keys = 1 2 4 3 1 3 4 2 1 
vals = 1 2 3 4 5 6 7 8 9

Output:

1 = 1,5,9
2 = 2,8
3 = 4,6
4 = 3,7

Compile options:

ghc --make -03 -fllvm histogram.hs

In imperative programming, I would just use a multimap so I decided to use a hash table and where the associated value is [Int] to store the grouped values. I am hoping there is a much better FP solution.

{-# LANGUAGE BangPatterns #-}
import qualified Data.HashMap.Strict as M
import qualified Data.Vector.Unboxed as V

n :: Int
n = 5000000

kv :: V.Vector (Int,Int)
kv = V.zip k v
 where
    k = V.generate n (\i -> i `mod` 1000)
    v = V.generate n (\i -> i)

ts :: V.Vector (Int,Int) -> M.HashMap Int Int
ts vec =
    V.foldl' (\ht (k, v) -> M.insertWith (+) k v ht) M.empty vec

ts2 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts2 vec =
    V.foldl' (\ht (!k, !v) -> M.insertWith (++) k [v] ht) M.empty vec

main :: IO ()
main = ts2 kv `seq` putStrLn "done"

Here's what spits out at runtime:

   3,117,102,992 bytes allocated in the heap
   1,847,205,880 bytes copied during GC
     324,159,752 bytes maximum residency (12 sample(s))
       6,502,224 bytes maximum slop
             658 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      5991 colls,     0 par    0.58s    0.58s     0.0001s    0.0003s
  Gen  1        12 colls,     0 par    0.69s    0.69s     0.0577s    0.3070s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.45s  (  0.45s elapsed)
  GC      time    1.27s  (  1.27s elapsed)
  EXIT    time    0.03s  (  0.03s elapsed)
  Total   time    1.75s  (  1.75s elapsed)

  %GC     time      72.7%  (72.8% elapsed)

  Alloc rate    6,933,912,935 bytes per MUT second

  Productivity  27.3% of total user, 27.3% of total elapsed

You can see it spends a lot of time in GC so I decided to use bangs to make the list concatenation strict. I guess the ++ is quite expensive too but don't know a workaround around this.

Was it helpful?

Solution 2

The bulk of your problem is due to (++) leading to "lots of useless copying of intermediate lists", as Carl puts it in his answer. Having played with a few different approaches at replacing (++), I got the best results thus far by switching to Data.IntMap.Strict from containers (just to take advantage of the less stern API - I don't know which implementation is more efficient per se) and using its alter function to prepend the vector elements without creating singleton lists:

import qualified Data.IntMap.Strict as M
import qualified Data.Vector.Unboxed as V

n :: Int
n = 5000000

kv :: V.Vector (Int,Int)
kv = V.zip k v
 where
    k = V.generate n (\i -> i `mod` 1000)
    v = V.generate n (\i -> i)

ts2 :: V.Vector (Int,Int) -> M.IntMap [Int]
ts2 vec =
    V.foldl' (\ht (k, v) -> M.alter (prep v) k ht) M.empty vec
    where
    prep x = Just . maybe [x] (x:)

main :: IO ()
main = print $ M.foldl' (+) 0 $ M.map length $ ts2 kv

The second best solution was using

\ht (k, v) -> M.insertWith (\(x:_) -> (x :)) k [v] ht

as the fold operator. That works with both Data.IntMap.Strict and Data.HashMap.Strict, with similar results performance-wise.

N.B.: Note that in all cases, your original implementation included, the vector elements are being prepended, rather than appended, to the lists. Your problems would be much more serious if you were appending the elements, as repeatedly appending to an empty list with (++) is quadratic in the number of elements.

OTHER TIPS

Those strictness annotations are useless. They're forcing only the first constructor of the lists.

Even worse, it appears you're attempting to left fold (++), which is never a good idea. It results in lots of useless copying of intermediate lists, even when it's made fully strict.

You should fold to a [Int] -> [Int] value, instead. That will get rid of the multiple useless allocations. I'm on mobile, so I can't really provide full example code. The main idea is that you change the loop to M.insertWith (.) k (v:) and then map ($ [] ) over the values in the HashMap after the fold.

I tried to run your code on my host and I am not able to reproduce your profile:

runhaskell test8.hs +RTS -sstderr
done
     120,112 bytes allocated in the heap
       3,520 bytes copied during GC
      68,968 bytes maximum residency (1 sample(s))
      12,952 bytes maximum slop
           1 MB total memory in use (0 MB lost due to fragmentation)

                                Tot time (elapsed)  Avg pause  Max pause
 Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
 Gen  1         1 colls,     0 par    0.00s    0.09s     0.0909s    0.0909s

 INIT    time    0.00s  (  0.01s elapsed)
 MUT     time    0.00s  ( 29.21s elapsed)
 GC      time    0.00s  (  0.09s elapsed)
 EXIT    time    0.00s  (  0.09s elapsed)
 Total   time    0.01s  ( 29.40s elapsed)

 %GC     time       5.7%  (0.3% elapsed)

 Alloc rate    381,307,936 bytes per MUT second

 Productivity  91.1% of total user, 0.0% of total elapsed

Can you pls outline some more detail about how you are testing the code? If you are using ghci then

$ ghci -fobject-code

we probably need to use -fobject-code to eliminate any space leaks from the ghci. If you have already tried the ghci option, assuming that you are using ghci, I will edit my answer. At this point, I would like to reproduce the issue you are seeing.

Update: @ duplode : Thank you for the pointers. I am going to delete the previous output no one objects to it as it is misleading.

I have been able to reduce the gc overhead by a bit using one of the following options. I am getting some benefits but the overhead is still in the 49 - 50 % range:

ts3 :: V.Vector (Int, Int) -> M.HashMap Int [Int]
ts3 vec =
V.foldl (\ht (!k, !v) -> 
    let
        element = M.lookup k ht in
    case element of
        Nothing -> M.insert k [v] ht
        Just aList -> M.insert k (v:aList) ht) M.empty vec
ts4 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts4 vec = 
    let initMap = V.foldl (\ht (!k,_) -> M.insert k [] ht) M.empty vec
in
    V.foldl (\ht (!k, !v) -> M.adjust(\x -> v:x) k ht) initMap vec

The adjust seemed a bit better, but they results seem similar to a straight lookup. With ts4 using adjust:

calling ts4 done.
 3,838,059,320 bytes allocated in the heap
 2,041,603,344 bytes copied during GC
 377,412,728 bytes maximum residency (6 sample(s))
   7,725,944 bytes maximum slop
         737 MB total memory in use (0 MB lost due to fragmentation)

                                Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7260 colls,     0 par    1.32s    1.45s     0.0002s    0.0013s
  Gen  1         6 colls,     0 par    0.88s    1.40s     0.2328s    0.9236s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.18s  (  2.21s elapsed)
  GC      time    2.19s  (  2.85s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.01s  (  0.07s elapsed)
  Total   time    4.38s  (  5.13s elapsed)

  %GC     time      50.0%  (55.5% elapsed)

  Alloc rate    1,757,267,879 bytes per MUT second

 Productivity  50.0% of total user, 42.7% of total elapsed

Using the simple lookup/update (imperative style of updating a map)

calling ts3 done.
3,677,137,816 bytes allocated in the heap
2,040,053,712 bytes copied during GC
 395,867,512 bytes maximum residency (6 sample(s))
   7,326,104 bytes maximum slop
         769 MB total memory in use (0 MB lost due to fragmentation)

                                Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6999 colls,     0 par    1.35s    1.51s     0.0002s    0.0037s
  Gen  1         6 colls,     0 par    1.06s    2.16s     0.3601s    1.3175s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.89s  (  2.07s elapsed)
  GC      time    2.41s  (  3.67s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.01s  (  0.08s elapsed)
  Total   time    4.31s  (  5.82s elapsed)

  %GC     time      55.9%  (63.0% elapsed)

  Alloc rate    1,942,816,558 bytes per MUT second

  Productivity  44.1% of total user, 32.6% of total elapsed

I am interested in finding out as to how to reduce the time for lookup as show in the profile output below:

COST CENTRE   MODULE  %time %alloc

ts3.\         Main     54.1   91.4
ts3.\.element Main     19.0    2.9
ts3           Main     11.0    2.9
kv.k          Main      6.5    1.4   
kv.v          Main      5.2    1.4
kv.k.\        Main      4.0    0.0


                                                         individual     inherited
COST CENTRE        MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN               MAIN                     72           0    0.0    0.0   100.0  100.0
 main              Main                    158           0    0.0    0.0     0.0    0.0
 CAF:main          Main                    143           0    0.0    0.0    84.2   97.1
  main             Main                    144           1    0.0    0.0    84.2   97.1
   ts3             Main                    145           1   11.0    2.9    84.2   97.1
    ts3.\          Main                    156     5000000   54.1   91.4    73.2   94.3
     ts3.\.element Main                    157     5000000   19.0    2.9    19.0    2.9
 CAF:kv            Main                    142           0    0.0    0.0     0.0    0.0

Code

    -- ghc -O2 --make test8.hs -prof -auto-all -caf-all -fforce-recomp +RTS
    -- ./test8 +RTS -p

    {-# LANGUAGE BangPatterns #-}
    import qualified Data.HashMap.Strict as M
    import qualified Data.Vector.Unboxed as V

    n :: Int
    n = 5000000

    kv :: V.Vector (Int,Int)
    kv = V.zip (k) (v)
     where
        k = V.generate n (\i -> i `mod` 1000)
        v = V.generate n (\i -> i)

    ts :: V.Vector (Int,Int) -> M.HashMap Int Int
    ts vec =
        V.foldl' (\ht (k, v) -> M.insertWith (+) k v ht) M.empty vec



    ts2 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
    ts2 vec =
        V.foldl (\ht (!k, !v) -> M.insertWith (++) k [v] ht) M.empty vec

    ts3 :: V.Vector (Int, Int) -> M.HashMap Int [Int]
    ts3 vec =
        V.foldl (\ht (!k, !v) -> 
            let
                element = M.lookup k ht in
            case element of
                Nothing -> M.insert k [v] ht
                Just aList -> M.insert k (v:aList) ht) M.empty vec
    ts4 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
    ts4 vec = 
            let initMap = V.foldl (\ht (!k,_) -> M.insert k [] ht) M.empty vec
        in
            V.foldl (\ht (!k, !v) -> M.adjust(\x -> v:x) k ht) initMap vec


    main :: IO ()
    main = ts3 kv `seq` putStrLn "calling ts3 done."

    main1 = do
                if x == y then
                    putStrLn "Algos Match"
                else
                    putStrLn "Error"
            where
                x = ts2 kv
                y = ts4 kv
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top