Write a parallel array Haskell expression once, run on CPUs & GPUs with repa and accelerate

https://stackoverflow.com/questions/23202145

07-07-2023
|

Question

Repa and Accelerate API Similarity

The Haskell repa library is for automatically parallel array computation on CPUs. The accelerate library is automatic data parallelism on GPUs. The APIs are quite similar, with identical representations of N-dimensional arrays. One can even switch between accelerate and repa arrays with fromRepa and toRepa in Data.Array.Accelerate.IO:

fromRepa :: (Shapes sh sh', Elt e) => Array A sh e -> Array sh' e
toRepa   :: Shapes sh sh'          => Array sh' e  -> Array A sh e

There are multiple backends for accelerate, including LLVM, CUDA and FPGA (see Figure 2 of http://www.cse.unsw.edu.au/~keller/Papers/acc-cuda.pdf). I've spotted a repa backend for accelerate, though the library doesn't appear to be maintained. Given that the repa and accelerate programming models are similar, I am hopeful that there is an elegant way of switching between them i.e. functions written once can be executed with repa's R.computeP or with one of accelerate's backends e.g. with the CUDA run function.

Two very similar functions: Repa and Accelerate on a Pumpkin

Take a simple image processing thresholding function. If a grayscale pixel value is less than 50, then it is set to 0, otherwise it retains its value. Here's what it does to a pumpkin:

The following code presents repa and accelerate implementations:

module Main where

import qualified Data.Array.Repa as R
import qualified Data.Array.Repa.IO.BMP as R
import qualified Data.Array.Accelerate as A
import qualified Data.Array.Accelerate.IO as A
import qualified Data.Array.Accelerate.Interpreter as A

import Data.Word

-- Apply threshold over image using accelerate (interpreter)
thresholdAccelerate :: IO ()
thresholdAccelerate = do
  img <- either (error . show) id `fmap` A.readImageFromBMP "pumpkin-in.bmp"
  let newImg = A.run $ A.map evalPixel (A.use img)
  A.writeImageToBMP "pumpkin-out.bmp" newImg
    where
      -- *** Exception: Prelude.Ord.compare applied to EDSL types
      evalPixel :: A.Exp A.Word32 -> A.Exp A.Word32
      evalPixel p = if p > 50 then p else 0

-- Apply threshold over image using repa
thresholdRepa :: IO ()
thresholdRepa = do
  let arr :: IO (R.Array R.U R.DIM2 (Word8,Word8,Word8))
      arr = either (error . show) id `fmap` R.readImageFromBMP "pumpkin-in.bmp" 
  img <- arr
  newImg <- R.computeP (R.map applyAtPoint img)
  R.writeImageToBMP "pumpkin-out.bmp" newImg
  where
    applyAtPoint :: (Word8,Word8,Word8) -> (Word8,Word8,Word8)
    applyAtPoint (r,g,b) =
        let [r',g',b'] = map applyThresholdOnPixel [r,g,b]
        in (r',g',b')
    applyThresholdOnPixel x = if x > 50 then x else 0

data BackendChoice = Repa | Accelerate

main :: IO ()
main = do
  let userChoice = Repa -- pretend this command line flag
  case userChoice of
    Repa       -> thresholdRepa
    Accelerate -> thresholdAccelerate

Question: can I write this only once?

The implementations of thresholdAccelerate and thresholdRepa are very similar. Is there an elegant way to write array processing functions once, then opt for multicore CPUs (repa) or GPUs (accelerate) in a switch programmatically? I can think of choosing my import in accordance with whether I want CPU or GPU i.e. to import either Data.Array.Accelerate.CUDA or Data.Array.Repa to execute an action of type Acc a with:

run :: Arrays a => Acc a -> a

Or, to use a type class e.g. something roughly like:

main :: IO ()
main = do
  let userChoice = Repa -- pretend this is a command line flag
  action <- case userChoice of
    Repa       -> applyThreshold :: RepaBackend ()
    Accelerate -> applyThreshold :: CudaBackend ()
  action

Or is it the case that, for each parallel array function I wish to express for both CPUs and GPUs, I must implement it twice --- once with the repa library and again with the accelerate library?

Solution

The short answer is that, at the moment, you unfortunately need to write both versions.

However, we are working on CPU support for Accelerate, which will obviate the need for the Repa version of the code. In particular, Accelerate very recently gained a new LLVM-based backend that targets both GPUs and CPUs: https://github.com/AccelerateHS/accelerate-llvm

This new backend is still incomplete, buggy, and experimental, but we are planning to make it into a viable alternative to the current CUDA backend.

OTHER TIPS

I thought about this a year and few months ago while designing yarr. At that time there were serious issues with type families inference or something like this (I don't remember exactly) which prevented to implement such unifying wrapper of vector, repa, yarr, accelerate, etc. both efficiently and allowing not to write too many explicit type signatures, or implement it in principle (I don't remember).

That was GHC 7.6. I don't know if there meaningful improvements in GHC 7.8 in this field. Theoretically I didn't see any problems, thus we can expect such stuff someday, in short or long time, when GHC will be ready.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow