Question

I'm using Data.Text.Lazy to process some text files. I read in 2 files and distribute their text to 3 files according to some criteria. The loop which does the processing is go'. I've designed it in a way in which it should process the files incrementally and keep nothing huge in memory. However, as soon as the execution reaches the go' part the memory keeps on increasing till it reaches around 90MB at the end, starting from 2MB.

Can someone explain why this memory increase happens and how to avoid it?

import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import System.IO
import System.Environment
import Control.Monad

main = do
  [in_en, in_ar] <- getArgs
  [h_en, h_ar] <- mapM (`openFile` ReadMode) [in_en, in_ar]
  hSetEncoding h_en utf8
  en_txt <- TI.hGetContents h_en
  let len = length $ T.lines en_txt
  len `seq` hClose h_en
  h_en <- openFile in_en ReadMode
  hs@[hO_lm, hO_en, hO_ar] <- mapM (`openFile` WriteMode) ["lm.txt", "tun_"++in_en, "tun_"++in_ar]
  mapM_ (`hSetEncoding` utf8) [h_en, h_ar, hO_lm, hO_en, hO_ar]
  [en_txt, ar_txt] <- mapM TI.hGetContents [h_en, h_ar]
  let txts@[_, _, _] = map T.unlines $ go len en_txt ar_txt
  zipWithM_ TI.hPutStr hs txts
  mapM_ (liftM2 (>>) hFlush hClose) hs
  print "success"
  where
    go len en_txt ar_txt = go' (T.lines en_txt) (T.lines ar_txt)
      where (q,r) = len `quotRem` 3000
            go' [] [] = [[],[],[]]
            go' en ar = let (h:bef, aft)    = splitAt q en 
                            (hA:befA, aftA) = splitAt q ar 
                            ~[lm,en',ar']   = go' aft aftA
                        in [bef ++ lm, h:en', hA:ar']

EDIT

As per @kosmikus's suggestion I've tried replacing zipWithM_ TI.hPutStr hs txts with a loop which prints line by line as shown below. The memory consumption is now 2GB+!

fix (\loop lm en ar -> do
  case (en,ar,lm) of
    ([],_,lm) -> TI.hPutStr hO_lm $ T.unlines lm
    (h:t,~(h':t'),~(lh:lt)) -> do
      TI.hPutStrLn hO_en h
      TI.hPutStrLn hO_ar h'
      TI.hPutStrLn hO_lm lh
      loop lt t t')
  lm en ar

What's going on here?

Was it helpful?

Solution

The function go' builds a [T.Text] with three elements. The list is built lazily: in each step of go each of the three lists becomes known to a certain extent. However, you consume this structure by printing each element to a file in order, using the line:

zipWithM_ TI.hPutStr hs txts

So the way you consume the data does not match the way you produce the data. While printing the first of the three list elements to a file, the other two are built and kept in memory. Hence the space leak.

Update

I think that for the current example, the easiest fix would be to write to the target files during the loop, i.e., in the go' loop. I'd modify go' as follows:

go' :: [T.Text] -> [T.Text] -> IO ()
go' [] [] = return ()
go' en ar = let (h:bef, aft)    = splitAt q en
                (hA:befA, aftA) = splitAt q ar
            in do
              TI.hPutStrLn hO_en h
              TI.hPutStrLn hO_ar hA
              mapM_ (TI.hPutStrLn hO_lm) bef
              go' aft aftA

And then replace the call to go and the subsequent zipWithM_ call with a plain call to:

go hs len en_txt ar_txt
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top