Examples of a monad whose Applicative part can be better optimized than the Monad part

Question 1

Another example is a strict left fold. You can write an applicative instance which allows you to compose folds so that the resulting fold can be performed on the data in a single pass and constant space. However, the monad instance needs to re-iterate from the beginning of the data for each bind and keep the whole list in memory.

{-# LANGUAGE GADTs #-}

import Criterion.Main

import Data.Monoid
import Control.Applicative
import Control.Monad

import Prelude hiding (sum)

data Fold e r where
    Step :: !(a -> e -> a) -> !a -> !(a -> r) -> Fold e r
    Bind :: !(Fold e r) -> !(r -> Fold e s) -> Fold e s

data P a b = P !a !b

instance Functor (Fold e) where
    fmap f (Step step acc ret) = Step step acc (f . ret)
    fmap f (Bind fld g) = Bind fld (fmap f . g)

instance Applicative (Fold e) where
    pure a    = Step const a id
    Step fstep facc fret <*> Step xstep xacc xret = Step step acc ret where
        step (P fa xa) e = P (fstep fa e) (xstep xa e)
        acc = P facc xacc
        ret (P fa xa) = (fret fa) (xret xa)

    Bind fld g <*> fldx = Bind fld ((<*> fldx) . g)
    fldf <*> Bind fld g = Bind fld ((fldf <*>) . g)

instance Monad (Fold e) where
    return = pure
    (>>=) = Bind

fold :: Fold e r -> [e] -> r
fold (Step _ acc ret) [] = ret acc
fold (Step step acc ret) (x:xs) = fold (Step step (step acc x) ret) xs
fold (Bind fld g) lst = fold (g $ fold fld lst) lst

monoidalFold :: Monoid m => (e -> m) -> (m -> r) -> Fold e r
monoidalFold f g = Step (\a -> mappend a . f) mempty g

count :: Num n => Fold e n
count = monoidalFold (const (Sum 1)) getSum

sum :: Num n => Fold n n
sum = monoidalFold Sum getSum

avgA :: Fold Double Double
avgA = liftA2 (/) sum count

avgM :: Fold Double Double
avgM = liftM2 (/) sum count

main :: IO ()
main = defaultMain
    [ bench "Monadic"     $ nf (test avgM) 1000000
    , bench "Applicative" $ nf (test avgA) 1000000
    ] where test f n = fold f [1..n]

I wrote the above from the top of my head as an example so it might not be the optimal implementation for applicative and monadic folds, but running the above gives me:

benchmarking Monadic
mean: 119.3114 ms, lb 118.8383 ms, ub 120.2822 ms, ci 0.950
std dev: 3.339376 ms, lb 2.012613 ms, ub 6.215090 ms, ci 0.950

benchmarking Applicative
mean: 51.95634 ms, lb 51.81261 ms, ub 52.15113 ms, ci 0.950
std dev: 850.1623 us, lb 667.6838 us, ub 1.127035 ms, ci 0.950

Question 2

Perhaps the canonical example is given by the vectors.

data Nat = Z | S Nat deriving (Show, Eq, Ord)

data Vec :: Nat -> * -> * where
  V0    ::                  Vec Z x
  (:>)  :: x -> Vec n x ->  Vec (S n) x

We can make them applicative with a little effort, first defining singletons, then wrapping them in a class.

data Natty :: Nat -> * where
  Zy  :: Natty Z
  Sy  :: Natty n -> Natty (S n)

class NATTY (n :: Nat) where
  natty :: Natty n

instance NATTY Z where
  natty = Zy

instance NATTY n => NATTY (S n) where
  natty = Sy natty

Now we may develop the Applicative structure

instance NATTY n => Applicative (Vec n) where
  pure   = vcopies natty
  (<*>)  = vapp

vcopies :: forall n x. Natty n -> x -> Vec n x
vcopies  Zy      x  =  V0
vcopies  (Sy n)  x  =  x :> vcopies n x   

vapp :: forall n s t. Vec n (s -> t) -> Vec n s -> Vec n t
vapp  V0         V0         = V0
vapp  (f :> fs)  (s :> ss)  = f s :> vapp fs ss

I omit the Functor instance (which should be extracted via fmapDefault from the Traversable instance).

Now, there is a Monad instance corresponding to this Applicative, but what is it? Diagonal thinking! That's what's required! A vector can be seen as the tabulation of a function from a finite domain, hence the Applicative is just a tabulation of the K- and S-combinators, and the Monad has a Reader-like behaviour.

vtail :: forall n x. Vec (S n) x -> Vec n x
vtail (x :> xs) = xs

vjoin :: forall n x. Natty n -> Vec n (Vec n x) -> Vec n x
vjoin Zy     _                  = V0
vjoin (Sy n) ((x :> _) :> xxss) = x :> vjoin n (fmap vtail xxss)

instance NATTY n => Monad (Vec n) where
  return    = vcopies natty
  xs >>= f  = vjoin natty (fmap f xs)

You might save a bit by defining >>= more directly, but any way you cut it, the monadic behaviour creates useless thunks for off-diagonal computations. Laziness might save us from slowing down by an armageddon factor, but the zipping behaviour of the <*> is bound to be at least a little cheaper than taking the diagonal of a matrix.

Question 3

As pigworker said, arrays are the obvious example; their monad instance is not just a bit more problematic on the conceptual level with type-indexed lengths etc., but also performs worse in the very much real-worldly Data.Vector implementation:

import Criterion.Main
import Data.Vector as V

import Control.Monad
import Control.Applicative

functions :: V.Vector (Int -> Int)
functions = V.fromList [(+1), (*2), (subtract 1), \x -> x*x]

values :: V.Vector Int
values = V.enumFromN 1 32

type NRuns = Int

apBencher :: (V.Vector (Int -> Int) -> V.Vector Int -> V.Vector Int)
           -> NRuns -> Int
apBencher ap' = run values
 where run arr 0 = V.sum arr 
       run arr n = run (functions `ap'` arr) $ n-1

main = defaultMain
        [ bench "Monadic"     $ nf (apBencher ap   ) 4
        , bench "Applicative" $ nf (apBencher (<*>)) 4 ]

$ ghc-7.6 -O1 -o -fllvm -o bin/bench-d0 def0.hs
$ bench-d0
warming up
estimating clock resolution...
mean is 1.516271 us (640001 iterations)
found 3768 outliers among 639999 samples (0.6%)
2924 (0.5%) high severe
estimating cost of a clock call...
mean is 41.62906 ns (12 iterations)
found 1 outliers among 12 samples (8.3%)
1 (8.3%) high severe

benchmarking Monadic
mean: 2.773062 ms, lb 2.769786 ms, ub 2.779151 ms, ci 0.950
std dev: 22.14540 us, lb 13.55686 us, ub 36.88265 us, ci 0.950

benchmarking Applicative
mean: 1.269351 ms, lb 1.267654 ms, ub 1.271526 ms, ci 0.950
std dev: 9.799454 us, lb 8.171284 us, ub 13.09267 us, ci 0.950

Note that it doesn't come out with the performance difference when you compile with -O2; apparently ap is replaced by <*> then. But >>= can only allocate the right amount of memory after each function call and then put the results in place, which appears to be quite time-expensive; whereas <*> can simply precompute the result length as the product of functions and values lengths, and then write to one fixed array.