Вопрос

I'm writing a program that creates a shell script containing one command for each image file in a directory. There are 667,944 images in the directory, so I need to handle the strictness/laziness issue properly.

Here's a simple example that gives me Stack space overflow. It does work if I give it more space using +RTS -Ksize -RTS, but it should be able run with little memory, producing output immediately. So I've been reading the stuff about strictness in the Haskell wiki and the wikibook on Haskell, trying to figure out how to fix the problem, and I think it's one of the mapM commands that is giving me grief, but I still don't understand enough about strictness to sort the problem.

I've found some other questions on SO that seem relevant (Is mapM in Haskell strict? Why does this program get a stack overflow? and Is Haskell's mapM not lazy?), but enlightenment still eludes me.

import System.Environment (getArgs)
import System.Directory (getDirectoryContents)

genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
  let infile = indir ++ '/':file
  let angle = 0 -- have to actually read the file to calculate this for real
  let outfile = outdir ++ '/':file
  return $! "convert " ++ infile ++ " -rotate " ++ show angle ++ 
    " -crop 143x143+140+140 " ++ outfile

main :: IO ()
main = do
  putStrLn "#!/bin/sh"
  (indir:outdir:_) <- getArgs
  files <- getDirectoryContents indir
  let imageFiles = filter (`notElem` [".", ".."]) files
  commands <- mapM (genCommand indir outdir) imageFiles
  mapM_ putStrLn commands

EDIT: TEST #1

Here's the newest version of the example.

import System.Environment (getArgs)
import System.Directory (getDirectoryContents)
import Control.Monad ((>=>))

genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
  let infile = indir ++ '/':file
  let angle = 0 -- have to actually read the file to calculate this for real
  let outfile = outdir ++ '/':file
  return $! "convert " ++ infile ++ " -rotate " ++ show angle ++ 
    " -crop 143x143+140+140 " ++ outfile

main :: IO ()
main = do
  putStrLn "TEST 1"
  (indir:outdir:_) <- getArgs
  files <- getDirectoryContents indir
  putStrLn $ show (length files)
  let imageFiles = filter (`notElem` [".", ".."]) files
  -- mapM_ (genCommand indir outdir >=> putStrLn) imageFiles
  mapM_ (\filename -> genCommand indir outdir filename >>= putStrLn) imageFiles

I compile it with the command ghc --make -O2 amy2.hs -rtsopts. If I run it with the command ./amy2 ~/nosync/GalaxyZoo/table2/images/ wombat, I get

TEST 1
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

If I instead run it with the command ./amy2 ~/nosync/GalaxyZoo/table2/images/ wombat +RTS -K20M, I get the correct output...eventually:

TEST 1
667946
convert /home/amy/nosync/GalaxyZoo/table2/images//587736546846572812.jpeg -rotate 0 -crop 143x143+140+140 wombat/587736546846572812.jpeg
convert /home/amy/nosync/GalaxyZoo/table2/images//587736542558617814.jpeg -rotate 0 -crop 143x143+140+140 wombat/587736542558617814.jpeg

...and so on.

Это было полезно?

Решение

This isn't really a strictness issue(*), but an order of evaluation issue. Unlike lazily evaluated pure values, monadic effects must happen in deterministic order. mapM executes every action in the given list and gathers the results, but it cannot return until the whole list of actions is executed, so you don't get the same streaming behavior as with pure list functions.

The easy fix in this case is to run both genCommand and putStrLn inside the same mapM_. Note that mapM_ doesn't suffer from the same issue since it is not building an intermediate list.

mapM_ (genCommand indir outdir >=> putStrLn) imageFiles

The above uses the "kleisli composition operator" >=> from Control.Monad which is like the function composition operator . except for monadic functions. You can also use the normal bind and a lambda.

mapM_ (\filename -> genCommand indir outdir filename >>= putStrLn) imageFiles

For more complex I/O applications where you want better composability between small, monadic stream processors, you should use a library such as conduit or pipes.

Also, make sure you are compiling with either -O or -O2.

(*) To be exact, it is also a strictness issue, because in addition to building a large, intermediate list in memory, laziness causes mapM to build unnecessary thunks and use up stack.

EDIT: So it seems the main culprit might be getDirectoryContents. Looking at the function's source code, it essentially does the same kind of list accumulation internally as mapM.

In order to do streaming directory listing, we need to use System.Posix.Directory which unfortunately makes the program incompatible with non-POSIX systems (like Windows). You can stream the directory contents by e.g. using continuation passing style

import System.Environment (getArgs)
import Control.Monad ((>=>))

import System.Posix.Directory (openDirStream, readDirStream, closeDirStream)
import Control.Exception (bracket)

genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
  let infile = indir ++ '/':file
  let angle = 0 -- have to actually read the file to calculate this for real
  let outfile = outdir ++ '/':file
  return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
    " -crop 143x143+140+140 " ++ outfile

streamingDirContents :: FilePath -> (FilePath -> IO ()) -> IO ()
streamingDirContents root cont = do
    let loop stream = do
            fp <- readDirStream stream
            case fp of
                [] -> return ()
                _   | fp `notElem` [".", ".."] -> cont fp >> loop stream
                    | otherwise -> loop stream
    bracket (openDirStream root) loop closeDirStream


main :: IO ()
main = do
  putStrLn "TEST 1"
  (indir:outdir:_) <- getArgs
  streamingDirContents indir (genCommand indir outdir >=> putStrLn)

Here's how you could do the same thing using conduit:

import System.Environment (getArgs)

import System.Posix.Directory (openDirStream, readDirStream, closeDirStream)

import Data.Conduit
import qualified  Data.Conduit.List as L
import Control.Monad.IO.Class (liftIO, MonadIO)

genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
  let infile = indir ++ '/':file
  let angle = 0 -- have to actually read the file to calculate this for real
  let outfile = outdir ++ '/':file
  return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
    " -crop 143x143+140+140 " ++ outfile

dirSource :: (MonadResource m, MonadIO m) => FilePath -> Source m FilePath
dirSource root = do
    bracketP (openDirStream root) closeDirStream $ \stream -> do
        let loop = do
                fp <- liftIO $ readDirStream stream
                case fp of
                    [] -> return ()
                    _  -> yield fp >> loop
        loop

main :: IO ()
main = do
    putStrLn "TEST 1"
    (indir:outdir:_) <- getArgs
    let files    = dirSource indir $= L.filter (`notElem` [".", ".."])
        commands = files $= L.mapM (liftIO . genCommand indir outdir)

    runResourceT $ commands $$ L.mapM_ (liftIO . putStrLn)

The nice thing about conduit is that you regain the ability to compose pieces of functionality with things like conduit versions of filter and mapM. The $= operator streams stuff forward in the chain and $$ connects the stream to a consumer.

The not-so-nice thing is that real world is complicated and writing efficient and robust code requires us to jump through some hoops with resource management. That's why all the operations work in the ResourceT monad transformer which keeps track of e.g. open file handles and cleans them up promptly and deterministically when they are no longer needed or e.g. if the computation gets aborted by an exception (this is in contrast to using lazy I/O and relying on the garbage collector to eventually release any scarce resources).

However, this means that we a) need to run the final resulting conduit operation with runResourceT and b) we need to explicitly lift I/O operations to the transformed monad using liftIO instead of being able to directly write e.g. L.mapM_ putStrLn.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top