Parallel Repa code doesn't create sparks

https://stackoverflow.com/questions/16097418

04-04-2022
|

Question

I'm writing code to do a subset product: it takes a list of elements and a list of indicator variables (of the same length). The product is computed in a tree, which is crucial to our application. Each product is expensive, so my goal was to compute each level of the tree in parallel, evaluating consecutive levels in sequence. Thus there isn't any nested parallelism going on.

I only have repa code in ONE function, near the top level of my overall code. Note that subsetProd is not monadic.

The steps:

chunk up the lists into pairs (no parallelism)
zip the chunked lists (no parallelism)
map the product function over this list (using Repa map), creating a Delayed array
call computeP to evaluate the map in parallel
convert the Repa result back to a list
make a recursive call (on lists half the size of the inputs)

The code:

{-# LANGUAGE TypeOperators, FlexibleContexts, BangPatterns #-}

import System.Random
import System.Environment (getArgs)
import Control.Monad.State
import Control.Monad.Identity (runIdentity)

import Data.Array.Repa as Repa
import Data.Array.Repa.Eval as Eval
import Data.Array.Repa.Repr.Vector

force :: (Shape sh) => Array D sh e -> Array V sh e
force = runIdentity . computeP

chunk :: [a] -> [(a,a)]
chunk [] = []
chunk (x1:x2:xs) = (x1,x2):(chunk xs)

slow_fib :: Int -> Integer
slow_fib 0 = 0
slow_fib 1 = 1
slow_fib n = slow_fib (n-2) + slow_fib (n-1) 

testSubsetProd :: Int -> Int -> IO ()
testSubsetProd size seed = do
    let work = do
            !flags <- replicateM size (state random)
            !values <- replicateM size (state $ randomR (1,10))
            return $ subsetProd values flags
        value = evalState work (mkStdGen seed)
    print value

subsetProd :: [Int] -> [Bool] -> Int
subsetProd [!x] _ = x
subsetProd !vals !flags = 
    let len = (length vals) `div` 2
        !valpairs = Eval.fromList (Z :. len) $ chunk vals :: (Array V (Z :. Int) (Int, Int))
        !flagpairs = Eval.fromList (Z :. len) $ chunk flags :: (Array V (Z :. Int) (Bool, Bool))
        !prods = force $ Repa.zipWith mul valpairs flagpairs
        mul (!v0,!v1) (!f0,!f1)
            | (not f0) && (not f1) = 1
            | (not f0) = v0+1
            | (not f1) = v1+1
            | otherwise = fromInteger $ slow_fib ((v0*v1) `mod` 35)
    in subsetProd (toList prods) (Prelude.map (uncurry (||)) (toList flagpairs))

main :: IO ()
main = do
  args <- getArgs
  let [numleaves, seed] = Prelude.map read args :: [Int]
  testSubsetProd numleaves seed

The entire program is compiled with

ghc -Odph -rtsopts -threaded -fno-liberate-case -funfolding-use-threshold1000 -funfolding-keeness-factor1000 -fllvm -optlo-O3

per these instructions, on GHC 7.6.2 x64.

I ran my program (Subset) using

$> time ./Test 4096 4 +RTS -sstderr -N4

8 seconds later later:

672,725,819,784 bytes allocated in the heap
 11,312,267,200 bytes copied during GC
   866,787,872 bytes maximum residency (49 sample(s))
   433,225,376 bytes maximum slop
        2360 MB total memory in use (0 MB lost due to fragmentation)

                                Tot time (elapsed)  Avg pause  Max pause


  Gen  0     1284212 colls, 1284212 par   174.17s   53.20s     0.0000s    0.0116s
  Gen  1        49 colls,    48 par   13.76s    4.63s     0.0946s    0.6412s

  Parallel GC work balance: 16.88% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  497.80s  (448.38s elapsed)
  GC      time  187.93s  ( 57.84s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  685.73s  (506.21s elapsed)

  Alloc rate    1,351,400,138 bytes per MUT second

  Productivity  72.6% of total user, 98.3% of total elapsed

gc_alloc_block_sync: 8670031
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 571398

My code does get slower as I increase the -N parameter, (7.628 seconds for -N1, 7.891 seconds for -N2, 8.659 seconds for -N4) but I'm getting 0 sparks created, which seems like a prime suspect as to why I'm not getting any parallelism. Also, compiling with a whole slew of optimizations helps with the runtime, but not the parallelism.

Threadscope confirms that no serious work is being done on three HECs, but the garbage collector seems to be using all 4 HECs.

threadscope for the -sstderr block above

So why isn't Repa making any sparks? My product tree has 64 leaves, so even if Repa made a spark for every internal node, there should be ~63 sparks. I feel like it could have something to do with my use of the ST monad encapsulating the parallelism, though I'm not quite sure why this would cause an issue. Perhaps sparks can only be created in an IO monad?

If this is the case, does anyone have an idea of how I could perform this tree product where each level is done in parallel (without resulting in nested parallelism, which seems unnecessary for my task). In general, perhaps there is a better way to parallelize the tree product or make better use of Repa.

Bonus points for explaining why the runtime increases as I increase the -N parameter, even when no sparks are created.

EDIT I changed the code example above to be a compiling example of my problem. The program flow almost perfectly matches my real code: I randomly choose some inputs, and then do a subset product on them. I am now using the identity monad. I have tried lots of small changes to my code: inlining or not, bang patterns or not, variations on using two Repa lists and a Repa zipWith vs zipping the lists sequentially and using a Repa map, etc, none of which helped at all.

Even if I'm running into this problem in my example code, my real program is much larger.

Solution

Why is there no parallelism?

The main reason (at least for your now simplified and working) program for there being no parallelism is that you're using computeP on an array of V representation, and normal vectors aren't strict in their element types. So you aren't actually doing any real work in parallel. The easiest fix is to use an unboxed U array as the result, by changing force to this definition:

force :: (Shape sh, Unbox e) => Array D sh e -> Array U sh e
force a = runIdentity (computeP a)

I do recall that in your original code you claimed you're working with a complicated datatype that isn't unboxed. But is it really impossible to make it so? Perhaps you can extract the data you actually need into some unboxable representation? Or make the type an instance of the Unbox class? If not, then you can also use the following variant of force that works for a V-array:

import Control.DeepSeq (NFData(..))

...

force :: (Shape sh, NFData e) => Array D sh e -> Array V sh e
force a = runIdentity $ do
  r  <- computeP a
  !b <- computeUnboxedP (Repa.map rnf r)
  return r

The idea here is that we first compute the V-array structure, and then we compute a U-array of () type from it by mapping rnf over the array. The resulting array is uninteresting, but each of the V-array's elements will be forced in the process¹.

Either of these changes brings runtime for a problem size of 4096 from ~9 down to ~3 seconds with -N4 on my machine.

In addition, I think it's strange that you convert between lists and arrays in every step. Why not make subsetProd take two arrays? Also, at least for the values, using an intermediate V array for the pairs seems unnecessary, you could just as well use a D array. But in my experiments these changes didn't have a significant beneficial effect on runtime.

Why are there no sparks?

Repa does never create sparks. Haskell has many different approaches to parallelism, and sparks are one particular mechanism that has special support in the run-time system. However, only some libraries, for example the parallel package and one particular scheduler of the monad-par package, actually make use of the mechanism. Repa, however, does not. It uses forkIO, i.e., threads, internally, but provides a pure interface to the outside. So the absence of sparks is in itself nothing to worry about.

_{1. I originally had no idea how to do that, so I asked Ben Lippmeier, the author of Repa. Thanks a lot to Ben for pointing out the option of mapping rnf to produce a different array, and the fact that there's an Unbox instance for (), to me.}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow