Poor performance with transpose and cumulative sum in Repa

https://stackoverflow.com/questions/6300428

haskell
repa

22-10-2019
|

Question

I have developed a cumulative sum function as defined below in the Haskell library Repa. However, I have run into an issue when combining this function with the transpose operation. All 3 of the following operations take well under a second:

cumsum $ cumsum $ cumsum x
transpose $ transpose $ transpose x
transpose $ cumsum x

However, if I write:

cumsum $ transpose x

performance degrades horrendously. While each individual operation in isolation takes well under a second on a 1920x1080 image, when combined they now take 30+ seconds...

Any ideas on what could be causing this? My gut tells me it has something to do with delayed arrays, not forcing at the right time, etc... But I do not have enough experience to track this down quite yet.

{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}

import Data.Array.Repa as Repa

{-# INLINE indexSlice #-}
indexSlice :: (Shape sh, Elt a) => Int -> Array (sh :. Int) a -> (sh :. Int) -> a
indexSlice from arr (z :. ix) = arr `unsafeIndex` (z :. (ix + from))

{-# INLINE sliceRange #-}
sliceRange :: (Slice sh, Shape sh, Elt a) => Int -> Int -> Array (sh :. Int) a -> Array (sh :. Int) a
sliceRange from to arr = fromFunction (z :. (to - from + 1)) $ indexSlice from arr
    where (z :. _) = extent arr

{-# INLINE cumsum' #-}
cumsum' :: (Slice (SliceShape sh), Slice sh, Shape (FullShape sh), Shape (SliceShape sh), Elt a, Num a) =>
     Array (FullShape sh :. Int) a -> t -> (sh :. Int) -> a
cumsum' arr f (sh :. outer) = Repa.sumAll $ sliceRange 0 outer $ Repa.slice arr (sh :. All)

{-# INLINE cumsum #-}
cumsum :: (FullShape sh ~ sh, Slice sh, Slice (SliceShape sh), Shape sh, Shape (SliceShape sh), Elt a, Num a) =>
    Array (sh :. Int) a -> Array (sh :. Int) a
cumsum arr = Repa.force $ unsafeTraverse arr id $ cumsum' arr

Solution

From a library implementor's perspective, the way to debug this is to create a wrapper for the suspect operation, then look at the core code to see if fusion has worked.

-- Main.hs ---------------------------------------------------
import Solver
import Data.Array.Repa.IO.BMP

main 
 = do   Right img       <- readImageFromBMP "whatever.bmp"
        print $ cumsumBMP img

-- Solver.hs --------------------------------------------------
{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}
module Solver (cumsumBMP) where
import Data.Array.Repa  as Repa
import Data.Word

{- all your defs -}

{-# NOINLINE cumsumBMP #-}
cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
cumsumBMP img = cumsum $ transpose img

I've put the "solver" code in a separate module, so we only have to wade through the core code for the definitions we care about.

Compile like:

touch Solver.hs ; ghc -O2 --make Main.hs \
 -ddump-simpl -dsuppress-module-prefixes -dsuppress-coercions  > dump

Go to the definition of cumsumBMP and search for the letrec keyword. Searching for letrec is a quick way to find the inner loops.

Not too far down I see this: (slightly reformatted)

case gen_a1tr
of _ {
  GenManifest vec_a1tv ->
    case sh2_a1tc  `cast` ... of _ { :. sh3_a1iu  sh4_a1iv ->
    case ix'_a1t9  `cast` ... of _ { :. sh1'_a1iz sh2'_a1iA ->
    case sh3_a1iu  `cast` ... of _ { :. sh5_X1n0  sh6_X1n2 ->
    case sh1'_a1iz `cast` ... of _ { :. sh1'1_X1n9 sh2'1_X1nb ->
    case sh5_X1n0             of _ { :. sh7_X1n8   sh8_X1na ->
    ...
    case sh2'1_X1nb           of _ { I# y3_X1nO ->
    case sh4_a1iv             of _ { I# y4_X1nP ->
    case sh2'_a1iA            of _ { I# y5_X1nX ->
    ...
    let { x3_a1x6 :: Int# [LclId]
      x3_a1x6 =
        +#
          (*#
             (+#
                (*#
                   y1_a1iM
                   y2_X1nG)
                y3_X1nO)
             y4_X1nP)
          y5_X1nX } in
    case >=#
           x3_a1x6
           0
    of ...

Disaster! The x3_a1x6 binding is clearly doing some useful work (multiplications, additions and suchlike) but it's wrapped in a long series of unboxing operations that are also executed for every loop iteration. What's worse is that it's unboxing the length and width (shape) of the array at every iteration, and this information will always be the same. GHC should really float these case expressions out of the loop, but it doesn't yet. This is an instance of Issue #4081 on the GHC trac, which hopefully will be fixed sometime soon.

The work around is to apply deepSeqArray to the incoming array. This places a demand on its value at the top level (outside the loop) which lets GHC know it's ok to move the case matches further up. For a function like cumsumBMP, we also expect the incoming array to already be manifest, so we can add an explicit case match for this:

{-# NOINLINE cumsumBMP #-}
cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
cumsumBMP img@(Array _ [Region RangeAll (GenManifest _)])
  = img `deepSeqArray` cumsum $ transpose img

Compiling again, the inner loop now looks much better:

letrec {
$s$wfoldlM'_loop_s2mW [...]
  :: Int# -> Word# -> Word# [...]
$s$wfoldlM'_loop_s2mW =
  \ (sc_s2mA :: Int#) (sc1_s2mB :: Word#) ->
    case <=# sc_s2mA a_s2ji of _ {
      False -> sc1_s2mB;
      True ->
        $s$wfoldlM'_loop_s2mW
          (+# sc_s2mA 1)
          (narrow8Word#
             (plusWord#
                sc1_s2mB
                (indexWord8Array#
                   rb3_a2gZ
                   (+#
                      rb1_a2gX
                      (+#
                         (*#
                            (+#
                               (*#
                                  wild19_X1zO
                                  ipv1_X1m5)
                               sc_s2mA)
                            ipv2_X1m0)
                         wild20_X1Ct)))))
    }; } in

That's a tight, tail recursive loop that only uses primitive operations. Provided you compile with -fllvm -optlo-O3, there's no reason that won't run as fast as an equivalent C program.

There's a slight hiccup when running it though:

desire:tmp benl$ ./Main 
Main: Solver.hs:(50,1)-(51,45): Non-exhaustive patterns in function cumsumBMP

This just reminds us that we need to force the array before calling cumsumBMP.

-- Main.hs ---------------------------------------------------
...
import Data.Array.Repa as Repa
main 
 = do   Right img       <- readImageFromBMP "whatever.bmp"
        print $ cumsumBMP $ Repa.force img

In summary:

You need to add some deepSeqArray and pattern matching goop to your top level functions to work around a current infelicity in GHC. This is demonstrated by the final version of the cumsumBMP function above. If you want GHC HQ to fix this soon then add yourself as a cc to Issue #4081 on the GHC trac. Repa programs will be much prettier when this is fixed.
You don't need to add the goop to every function. In this example I didn't need to touch indexSlice and friends. The general rule is to add the goop to functions that use force, fold or sumAll. These functions instantiate the actual loops that operate over the array data, that is, they convert a delayed array to a manifest value.
The performance of a piece of Repa code is determined as much by the context in which it's used as the actual code. If you pass your top level functions delayed arrays then they will run very slowly. There is more discussion of this in The Repa Tutorial.
BMP files read with the repa-io library aren't pre-forced, so you need to force them before use. This is probably the wrong default, so I'll change it in the next version.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow