What's the ideal implementation for the Sieve of Eratosthenes between Lists, Arrays, and Mutable Arrays?

Question 1

This

primes = sieve [2..] where
         sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ]

is not a sieve. It's very inefficient trial division. Don't use that!

I'm curious about how you got your times, there is no way that the Turner "sieve" could produce the primes not exceeding 2,000,000 in mere seconds. Letting it find the primes to 200,000 took

MUT     time    6.38s  (  6.39s elapsed)
GC      time    9.19s  (  9.20s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time   15.57s  ( 15.59s elapsed)

on my box (64-bit Linux, ghc-7.6.1, compiled with -O2). The complexity of that algorithm is O(N² / log² N), almost quadratic. Letting it proceed to 2,000,000 would take about twenty minutes.

Your times for the array versions are suspicious too, though in the other direction. Did you measure interpreted code?

Sieving to 2,000,000, compiled with optimisations, the mutable array code took 0.35 seconds to run, and the immutable array code 0.12 seconds.

Now, that still has the mutable array about three times slower than the immutable array.

But, it's an unfair comparison. For the immutable array, you used Ints, and for the mutable array Integers. Changing the mutable array code to use Ints - as it should, since under the hood, arrays are Int-indexed, so using Integer is an unnecessary performance sacrifice that buys nothing - made the mutable array code run in 0.15 seconds. Close to the mutable array code, but not quite there. However, you let the mutable array do more work, since in the immutable array code you only eliminate odd multiples of the odd primes, but in the mutable array code, you mark all multiples of all primes. Changing the mutable array code to treat 2 specially, and only eliminate odd multiples of odd primes brings that down to 0.12 seconds too.

But, you're using range-checked array indexing, which is slow, and, since the validity of the indices is checked in the code itself, unnecessary. Changing that to using unsafeRead and unsafeWrite brings down the time for the immutable array to 0.09 seconds.

Then you have the problem that using

forM_ [x, y .. z]

uses boxed Ints (fortunately, GHC eliminates the list). Writing a loop yourself, so that only unboxed Int#s are used, the time goes down to 0.02 seconds.

{-# LANGUAGE MonoLocalBinds #-}
import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base

primeSieve :: Int -> UArray Int Bool
primeSieve top = runSTUArray $ do
    a <- newArray (0,top) True
    unsafeWrite a 0 False
    unsafeWrite a 1 False
    let r = ceiling . sqrt $ fromIntegral top
        mark step idx
            | top < idx = return ()
            | otherwise = do
                unsafeWrite a idx False
                mark step (idx+step)
        sift p
            | r < p     = return a
            | otherwise = do
                prim <- unsafeRead a p
                when prim $ mark (2*p) (p*p)
                sift (p+2)
    mark 2 4
    sift 3

-- Return primes from sieve as list:
primesTo :: Int -> [Int]
primesTo top = [p | (p,True) <- assocs $ primeSieve top]

main :: IO ()
main = print .last $ primesTo 2000000

So, wrapping up, for a Sieve of Eratosthenes, you should use an array - not surprising, its efficiency depends on being able to step from one multiple to the next in short constant time.

You get very simple and straightforward code with immutable arrays, and that code performs decently for not too high limits (it gets relatively worse for higher limits, since the arrays are still copied and garbage-collected, but that's not too bad).

When you need better performance, you need mutable arrays. Writing efficient mutable array code is not entirely trivial, one has to know how the compiler translates the various constructs to choose the right one, and some would consider such code unidiomatic. But you can also use a library (disclaimer: I wrote it) that provides a fairly efficient implementation rather than writing it yourself.

Question 2

Mutable array will always be the winner in terms of performance (and you really should've copied the version that works on odds only as a minimum; it should be the fastest of the three - also because it uses Int and not Integer).

For lists, tree-shaped merging incremental sieve should perform better than the one you show. You can always use it with takeWhile (< limit) if needed. I contend that it conveys the true nature of the sieve most clearly:

import Data.List (unfoldr)

primes :: [Int]         
primes = 2 : _Y ((3 :) . gaps 5 . _U . map (\p -> [p*p, p*p+2*p..]))

_Y g = g (_Y g)                                -- recursion 
_U ((x:xs):t) = (x :) . union xs . _U          -- ~= nub . sort . concat
                      . unfoldr (\(a:b:c) -> Just (union a b, c)) $ t

gaps k s@(x:xs) | k < x     = k : gaps (k+2) s   -- ~= [k,k+2..]\\s, when 
                | otherwise =     gaps (k+2) xs  --  k<=x && null(s\\[k,k+2..])

union a@(x:xs) b@(y:ys) = case compare x y of  -- ~= nub . sort .: (++)
         LT -> x : union  xs  b
         EQ -> x : union  xs ys
         GT -> y : union  a  ys

_U reimplements Data.List.Ordered.unionAll, and gaps 5 is (minus [5,7..]), fused for efficiency, with minus and union from the same package.

Of course nothing beats the brevity of Data.List.nubBy (((>1).).gcd) [2..] (but it's very slow).

To your 1st new question: not. It does find the multiples by counting up, as any true sieve should _{(although "minus" on lists is of course under-performant; the above improves on that by re-arranging a linear subtraction chain ((((xs-a)-b)-c)- ... ) into a subtraction of tree-folded additions, xs-(a+((b+c)+...))).}

Question 3

As has been said, using mutable arrays has the best performance. The following code is derived from this "TemplateHaskell" version converted back to something more in line with the straight mutable Array solution as the "TemplateHaskell" doesn't seem to make any difference, with some further optimizations. I believe it is faster than the usual mutable unboxed array versions due to the further optimizations and especially due to the use of the "unsafeRead" and "unsafeWrite" functions that avoid range checking of the arrays, probably also internally using pointers for array access:

{-# OPTIONS -O2 -optc-O3 #-}
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base

primesToUA :: Word32-> [Word32]
primesToUA top = do
    let sieveUA top = runSTUArray $ do    
        let m = ((fromIntegral top) - 3) `div` 2 :: Int
        buf <- newArray (0,m) True -- :: ST s (STUArray s Int Bool)
        let cullUA i = do
            let p = i + i + 3
                strt = p * (i + 1) + i
            let cull j = do
                if j > m then cullUA (i + 1)
                else do
                    unsafeWrite buf j False
                    cull (j + p)
            if strt > m then return ()
            else do
                e <- unsafeRead buf i
                if e then cull strt else cullUA (i + 1)
        cullUA 0; return buf
    if top > 1 then 2 : [2 * (fromIntegral i) + 3 | (i,True) <- assocs $ sieveUA top]
    else []

main = do 
    x <- read `fmap` getLine      --   1mln    2mln    10mln   100mln
    print (last (primesToUA x))   --   0.01    0.02     0.09     1.26  seconds

EDIT: The above code has been corrected and the times below edited to reflect the correction and as well the comments comparing the paged to the non-paged version have been edited.

The times to run this to the indicated top ranges are as shown in the comment table at the bottom of the above source code as measured by ideone.com and are about exactly five times faster than the answer posted by @WillNess as also measured at ideone.com. This code takes a trivial amount of time to cull the primes to two million and only 1.26 seconds to cull to a hundred million. These times are about 2.86 times faster when run on my i7 (3.5 GHz) at 0.44 seconds to a hundred million, and takes 6.81 seconds to run to one billion. The memory use is just over six megabytes for the former and sixty megabytes for the latter, which is the memory used by the huge (bit packed) array. This array also explains the non-linear performance in that as the array size exceeds the CPU cache sizes the average memory access times get worse per composite number representation cull.

EDIT_ADD: A page segmented sieve is more efficient in that it has better memory access efficiency when the buffer size is kept smaller than the L1 or L2 CPU caches, and as well has the advantage that it is unbounded so that the upper range does not have to be specified in advance and a much smaller memory footprint being just the base primes less than the square root of the range used plus the page buffer memory. The following code has been written as a page segmented implementation and is somewhat faster than the non-paged version; it also offers the advantage that one can change the output range specification at the top of the code to 'Word64' from 'Word32' so it is then not limited to the 32-bit number range, at only a slight cost in processing time (for 32-bit compiled code) for any range that is in common. The code is as follows:

-- from http://www.haskell.org/haskellwiki/Prime_numbers#Using_ST_Array

{-# OPTIONS -O2 -optc-O3 #-}
import Data.Word
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base

primesUA :: () -> [Word32]
primesUA () = do
    let pgSZBTS = 262144 * 8
    let sieveUA low bps = runSTUArray $ do
        let nxt = (fromIntegral low) + (fromIntegral pgSZBTS)
        buf <- newArray (0,pgSZBTS - 1) True -- :: ST s (STUArray s Int Bool)
        let cullUAbase i = do
            let p = i + i + 3
                strt = p * (i + 1) + i
            when (strt < pgSZBTS) $ do
                e <- unsafeRead buf i
                if e then do
                    let loop j = do
                        if j < pgSZBTS then do
                            unsafeWrite buf j False
                            loop (j + p)
                        else cullUAbase (i + 1)
                    loop strt
                else cullUAbase (i + 1)
        let cullUA ~(p:t) = do
            let bp = (fromIntegral p)
                i = (bp - 3) `div` 2
                s = bp * (i + 1) + i
            when (s < nxt) $ do
                let strt = do
                    if s >= low then fromIntegral (s - low)
                    else  do
                        let b = (low - s) `rem` bp
                        if b == 0 then 0 else fromIntegral (bp - b)
                let loop j = do
                    if j < pgSZBTS then do
                        unsafeWrite buf j False
                        loop (j + (fromIntegral p))
                    else cullUA t
                loop strt
        if low <= 0 then cullUAbase 0 else cullUA bps
        return buf
    let sieveList low bps = do
        [2 * ((fromIntegral i) + low) + 3 | (i,True) <- assocs $ sieveUA low bps]
    let sieve low bps = do
        (sieveList low bps) ++ sieve (low + (fromIntegral pgSZBTS)) bps
    let primes' = ((sieveList 0 []) ++ sieve (fromIntegral pgSZBTS) primes') :: [Word32]
    2 : sieve 0 primes'

main = do 
   x <- read `fmap` getLine      --   1mln    2mln    10mln   100mln
                                 --   0.02    0.03     0.13     1.13  seconds
   print (length (takeWhile ((>=) (fromIntegral x)) (primesUA ())))

The above code has quite a few more lines of code than the non-paged case due to the need to cull the composite number representations from the first page array differently than succeeding pages. The code also has the fixes so that there aren't memory leaks due to the base primes list and the output list now not being the same list (thus avoiding holding onto the whole list in memory).

Note that this code takes close to linear time (over the range sieved) as the range gets larger due to the culling buffer being of a constant size less than the L2 CPU cache. Memory use is a fraction of that used by the non-paged version at just under 600 kilobytes for a hundred million and just over 600 kilobytes for one billion, which slight increase is just the extra space required for the base primes less than the square root of the range list.

On ideone.com this code produces the number of primes to a hundred million in about 1.13 seconds and about 12 seconds to one billion (32-bit setting). Probably wheel factorization and definitely multi-core processing would make it even faster on a multi-core CPU. On my i7 (3.5 GHz), it takes 0.44 seconds to sieve to a hundred million and 4.7 seconds to one billion, with the roughly linear performance with increasing range as expected. It seems that there is some sort of non-linear overhead in the version of GHC run on ideone.com that has some performance penalty for larger ranges that is not present for the i7 and that is perhaps related to different garbage collection, as the page buffers are being created new for each new page. END_EDIT_ADD

EDIT_ADD2: It seems that much of the processing time for the above page segmented code is used in (lazy) list processing, so the code is accordingly reformulated with several improvements as follows:

Implemented a prime counting function that does not use list processing and uses "popCount" table look ups to count the number of 'one' bits in a 32 bit word at a time. In this way, the time to find the results is insignificant compared to the actual sieve culling time.
Stored the base primes as a list of bit packed page segments, which is much more space efficient than storing a list of primes, and the the time to convert the page segments to primes as required is not much of a computational overhead.
Tuned the prime segment make function so that for the initial zero page segment, it uses its own bit pattern as a source page, thus making the composite number cull code shorter and simpler.

The code then becomes as follows:

{-# OPTIONS -O3 -rtsopts #-} -- -fllvm ide.com doesn't support LLVM

import Data.Word
import Data.Bits
import Control.Monad
import Control.Monad.ST
import Data.Array.ST (runSTUArray)
import Data.Array.Unboxed
import Data.Array.Base

pgSZBTS = (2^18) * 8 :: Int -- size of L2 data cache

type PrimeType = Word32
type Chunk = UArray PrimeType Bool

-- makes a new page chunk and culls it
--   if the base primes list provided is empty then
--   it uses the current array as source (for zero page base primes)
mkChnk :: Word32 -> [Chunk] -> Chunk
mkChnk low bschnks = runSTUArray $ do
  let nxt = (fromIntegral low) + (fromIntegral pgSZBTS)
  buf <- nxt `seq` newArray (fromIntegral low, fromIntegral nxt - 1) True
  let cull ~(p:ps) =
        let bp = (fromIntegral p)
            i = (bp - 3) `shiftR` 1
            s = bp * (i + 1) + i in
        let cullp j = do
              if j >= pgSZBTS then cull ps
              else do
                unsafeWrite buf j False
                cullp (j + (fromIntegral p)) in
        when (s < nxt) $ do
          let strt = do
                if s >= low then fromIntegral (s - low)
                else  do
                  let b = (low - s) `rem` bp
                  if b == 0 then 0 else fromIntegral (bp - b)
          cullp strt
  case bschnks of
    [] -> do bsbf <- unsafeFreezeSTUArray buf
             cull (listChnkPrms [bsbf])
    _ -> cull $ listChnkPrms bschnks
  return buf

-- creates a page chunk list starting at the lw value
chnksList :: Word32 -> [Chunk]
chnksList lw =
  mkChnk lw basePrmChnks : chnksList (lw + fromIntegral pgSZBTS)

-- converts a page chunk list to a list of primes
listChnkPrms :: [Chunk] -> [PrimeType]
listChnkPrms [] = []
listChnkPrms ~(hdchnk@(UArray lw _ rng _):tlchnks) =
  let nxtp i =
        if i >= rng then [] else
          if unsafeAt hdchnk i then
            (case ((lw + fromIntegral i) `shiftL` 1) + 3 of
              np -> np) : nxtp (i + 1)
          else nxtp (i + 1) in
  (hdchnk `seq` lw `seq` nxtp 0) ++ listChnkPrms tlchnks

-- the base page chunk list used to cull the higher order pages,
--   note that it has special treatment for the zero page.
--   It is more space efficient to store this as chunks rather than
--   as a list of primes or even a list of deltas (gaps), with the
--   extra processing to convert as needed not too much.
basePrmChnks :: [Chunk]
basePrmChnks = mkChnk 0 [] : chnksList (fromIntegral pgSZBTS)

-- the full list of primes could be accessed with the following function.
primes :: () -> [PrimeType]
primes () = 2 : (listChnkPrms $ chnksList 0)

-- a quite fast prime counting up to the given limit using
--   chunk processing to avoid innermost list processing loops.
countPrimesTo :: PrimeType -> Int
countPrimesTo limit =
  let lmtb = (limit - 3) `div` 2 in
  let sumChnks acc chnks@(chnk@(UArray lo hi rng _):chnks') =
        let cnt :: UArray PrimeType Word32 -> Int
            cnt bfw =
              case if lmtb < hi then fromIntegral (lmtb - lo) else rng of
                crng -> case crng `shiftR` 5 of
                  rngw -> 
                    let cnt' i ac =
                          ac `seq` if i >= rngw then
                            if (i `shiftL` 5) >= rng then ac else
                              case (-2) `shiftL` fromIntegral (lmtb .&. 31) of
                                msk -> msk `seq`
                                  case (unsafeAt bfw rngw) .&.
                                       (complement msk) of
                                    bts -> bts `seq` case popCount bts of
                                      c -> c `seq` case ac + c of nac -> nac
                          else case ac + (popCount $ unsafeAt bfw i) of
                                 nacc -> nacc `seq` cnt' (i + 1) (nacc)
                    in cnt' 0 0 in
        acc `seq` case runST $ do -- make UArray _ Bool into a UArray _ Word32
          stbuf <- unsafeThawSTUArray chnk
          stbufw <- castSTUArray stbuf
          bufw <- unsafeFreezeSTUArray stbufw
          return $ cnt bufw of
            c -> c `seq` case acc + c of
              nacc -> nacc `seq` if hi >= lmtb then nacc
                      else sumChnks nacc chnks' in
  if limit < 2 then 0 else if limit < 3 then 1 else
    lmtb `seq` sumChnks 1 (chnksList 0)

main = do 
  x <- read `fmap` getLine  --  1mln   2mln  10mln  100mln 1000mln
                            --  0.02   0.03   0.06    0.45    4.60   seconds
                            --  7328   7328   8352    8352    9424  Kilobytes
  -- this takes 14.34 seconds and 9424 Kilobytes to 3 billion on ideone.com,
  -- and 9.12 seconds for 3 billion on an i7-2700K (3.5 GHz).
  -- The above ratio of about 1.6 is much better than usual due to
  -- the extremely low memory use of the page segmented algorithm.
  -- It seems thaat the Windows Native Code Generator (NCG) from GHC
  --   is particularly poor, as the Linux 32-bit version takes
  --   less than two thirds of the time for exactly the same program...
  print $ countPrimesTo x
--  print $ length $ takeWhile ((>=) x) $ primes () -- the slow way to do this

The times and memory requirements given in the code are as observed when run on ideone.com, with 0.02, 0.03, 0.05, 0.30, 3.0, and 9.1 seconds required to run on my i7-2700K (3.5 GHz) for one, two, ten, a hundred, a thousand (one billion), and three thousand (three billion) million range, respectively, with a pretty much constant memory footprint slowly increasing with the number of base primes less than the square root of the range as required. When compiled with the LLVM compiler back end, these times become 0.01, 0.02, 0.02, 0.12, 1.35, and 4.15 seconds, respectively, due to its more efficient use of registers and machine instructions; this last is quite close to the same speed as if compiled with a 64-bit compiler rather than the 32-bit compiler used as the efficient use of registers means that availability of extra registers doesn't make much difference.

As commented in the code, the ratio between performance on my real machine and the ideone.com servers becomes much less than for much more memory wasteful algorithms due to not being throttled by memory access bottlenecks so the limit to speed is mostly just the ratio of CPU clock speeds as well as CPU processing efficiency per clock cycle. However, as commented there, there is some strange inefficiency with the GHC Native Code Generator (NCG) when run under Windows (32-bit compiler) in that the run times are over 50% slower than if run under Linux (as the ideone.com server uses). AFAIK they both have a common code base for the same Haskell GHC version and the only divergence is in the linker used (which is also used with the LLVM backend, which is not affected) as GHC NCG does not use GCC but only the mingw32 assembler, which should also be the same.

Note that this code when compiled with the LLVM compiler back end is about the same speed as the same algorithm written for highly optimized 'C/C++' implementations indicating that Haskell really has the ability to develop very tight loop coding. It might be said that the Haskell code is quite a bit more readable and secure than equivalent 'C/C++' code once one gets used to the Haskell paradigms of monadic and non-strict code. The further refinements in execution speed for the Sieve of Eratosthenes are purely a function of the tuning of the implementations used and not the choice of language between Haskell and 'C/C++'.

Summary: Of course, this isn't yet the ultimate in speed for a Haskell version of the Sieve of Eratosthenes in that we still haven't tuned the memory access to more efficiently use the fast CPU L1 cache, nor have we significantly reduced the total number of composite culling operations necessary using extreme wheel factorization other than to eliminate the odds processing. However, this is enough to answer the question in showing that mutable arrays are the most efficient way of addressing such tight loop type of problems, with potential speed gains of about 100 times over the use of lists or immutable arrays. END_EDIT_ADD2