Optimizing Perlin noise in Haskell

Question 1

This code appears to be mostly computation-bound. It can be improved a little bit, but not by much unless there's a way to use fewer array lookups and less arithmetic.

There are two useful tools for measuring performance: profiling and code dumps. I added an SCC annotation to perlin3 so that it would show up in the profile. Then I compiled with gcc -O2 -fforce-recomp -ddump-simpl -prof -auto. The -ddump-simpl flag prints the simplified code.

Profiling: On my computer, it takes 0.60 seconds to run the program, and about 20% of execution time (0.12 seconds) is spent in perlin3 according to the profile. Note that the precision of my profile info is about +/-3%.

Simplifier output: The simplifier produces fairly clean code. perlin3 gets inlined into pixelRenderer, so that's the part of the output you want to look at. Most of the code consists of unboxed array reads and unboxed arithmetic. To improve performance, we want to eliminate some of this arithmetic.

An easy change is to eliminate the run-time checks on SomeFraction (which doesn't appear in your question, but is part of the code that you uploaded). This reduces the program's execution time to 0.56 seconds.

-- someFraction t | 0 <= t, t < 1 = SomeFraction t
someFraction t = SomeFraction t

Next, there are several array lookups that show up in the simplifier like this:

                 case GHC.Prim.indexWord8Array#
                        ipv3_s23a
                        (GHC.Prim.+#
                           ipv1_s21N
                           (GHC.Prim.word2Int#
                              (GHC.Prim.and#
                                 (GHC.Prim.narrow8Word#
                                    (GHC.Prim.plusWord# ipv5_s256 (__word 1)))
                                 (__word 255))))

The primitive operation narrow8Word# is for coercing from an Int to a Word8. We can get rid of this coercion by using Int instead of Word8 in the definition of next.

next :: Permutation -> Int -> Int
next (Permutation !v) !idx'
  = fromIntegral $ v `V.unsafeIndex` (fromIntegral idx' .&. 0xFF)

This reduces the program's execution time to 0.54 seconds. Considering just the time spent in perlin3, the execution time has fallen (roughly) from 0.12 to 0.06 seconds. Although it's hard to measure where the rest of the time is going, it's most likely spread out among the remaining arithmetic and array accesses.

Question 2

On my machine reference code with Heatsink's optimisations takes 0.19 secs.

Firstly, I has moved from JuicyPixels to yarr and yarr-image-io with my favourite flags, -Odph -rtsopts -threaded -fno-liberate-case -funbox-strict-fields -fexpose-all-unfoldings -funfolding-keeness-factor1000 -fsimpl-tick-factor=500 -fllvm -optlo-O3 (they are given here):

import Data.Yarr as Y
import Data.Yarr.IO.Image as Y
...

main = do
    [target] <- getArgs
    image <- dComputeS $ fromFunction (512, 512) (return . pixelRenderer)
    Y.writeImage target (Grey image)
  where
    pixelRenderer, pixelRenderer' :: Dim2 -> Word8
    pixelRenderer (y, x)
        = floor $ ((perlin3 permutation ((fromIntegral x - 256) / 32,
          (fromIntegral y - 256) / 32, 0 :: Double))+1)/2 * 128

    -- This code is much more readable, but also much slower.
    pixelRenderer' (y, x)
        = (\w -> floor $ ((w+1)/2 * 128)) -- w should be in [-1,+1]
        . perlin3 permutation
        . (\(x,y,z) -> ((x-256)/32, (y-256)/32, (z-256)/32))
        $ (fromIntegral x, fromIntegral y, 0 :: Double)

This makes the program 30% faster, 0.13 seconds.

Secondly I has replaced uses of standard floor with

doubleToByte :: Double -> Word8
doubleToByte f = fromIntegral (truncate f :: Int)

It is known issue (google "haskell floor performance"). Execution time is reduced to 52 ms (0.052 secs), in almost 3 times.

Finally, just for fun I tried to compute noise in parallel (dComputeP instead of dComputeS and +RTS -N4 in command line run). Program took 36 ms, including I/O constant of about 10 ms.