Why is my code using monadic lists from the List package so slow?
-
29-09-2019 - |
Question
Last week user Masse asked a question about recursively listing files in a directory in Haskell. My first thought was to try using monadic lists from the List
package to avoid building the entire list in memory before the printing can start. I implemented this as follows:
module Main where
import Prelude hiding (filter)
import Control.Applicative ((<$>))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))
main = execute . mapL putStrLn . listFiles =<< head <$> getArgs
listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
where
valid "." = False
valid ".." = False
valid _ = True
listIfDir False = return path
listIfDir True
= cons path
$ join
$ listFiles
<$> (path </>)
<$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))
This works beautifully in that it starts printing immediately and uses very little memory. Unfortunately it's also dozens of times slower than a comparable FilePath -> IO [FilePath]
version.
What am I doing wrong? I've never used the List
package's ListT
outside of toy examples like this, so I don't know what kind of performance to expect, but 30 seconds (vs. a fraction of a second) to process a directory with ~40,000 files seems much too slow.
Solution
Profiling shows that join
(together with doesDirectoryExists
) accounts for most of the time in your code. Lets see how its definition unfolds:
join x
=> (definition of join in Control.Monad)
x >>= id
=> (definition of >>= in Control.Monad.ListT)
foldrL' mappend mempty (fmap id x)
=> (fmap id = id)
foldrL' mappend mempty x
If in the root directory of the search there are k
subdirectories and their contents are already computed in the lists: d1, d2, ... dk
, then after applying join
you'll get (roughly): (...(([] ++ d1) ++ d2) ... ++ dk)
. Since x ++ y
takes time O(length x)
the whole thing will take time O(d1 + (d1 + d2) + ... + (d1 + ... dk-1))
. If we assume that the number of files is n
and they are evenly distributed between d1 ... dk
then the time to compute join
would be O(n*k)
and that is only for the first level of listFiles
.
This, I think, is the main performance problem with your solution.
OTHER TIPS
I'm curious, how well does the same program written to use logict work for you? LogicT
is semantically the same as ListT
, but implemented in continuation-passing style so that it shouldn't have the concat
-related type of problems you seem to be running into.
import Prelude hiding (filter)
import Control.Applicative
import Control.Monad
import Control.Monad.Logic
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))
main = sequence_ =<< observeAllT . fmap putStrLn . listFiles =<< head <$> getArgs
cons :: MonadPlus m => a -> m a -> m a
cons x xs = return x `mplus` xs
fromList :: MonadPlus m => [a] -> m a
fromList = foldr cons mzero
filter :: MonadPlus m => (a -> Bool) -> m a -> m a
filter f xs = do
x <- xs
guard $ f x
return x
listFiles :: FilePath -> LogicT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
where
valid "." = False
valid ".." = False
valid _ = True
listIfDir False = return path
listIfDir True
= cons path
$ join
$ listFiles
<$> (path </>)
<$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))
Running it on a large directory reveals a memory leak. I suspect this has to do with the strictness of getDirectoryContents, but there might be more going on. Simple profiling didn't turn up much, I'd add some extra cost centers and go from there...