سؤال

So, I've played around with several Haskell XML libraries, including hexpat and xml-enumerator. After reading the IO chapter in Real World Haskell (http://book.realworldhaskell.org/read/io.html) I was under the impression that if I run the following code, it will be garbage collected as I go through it.

However, when I run it on a big file, memory usage keeps climbing as it runs.

runghc parse.hs bigfile.xml

What am I doing wrong? Is my assumption wrong? Does the map/filter force it to evaluate everything?

import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.UTF8 as U
import Prelude hiding (readFile)
import Text.XML.Expat.SAX 
import System.Environment (getArgs)

main :: IO ()
main = do
    args <- getArgs
    contents <- BSL.readFile (head args)
    -- putStrLn $ U.toString contents
    let events = parse defaultParseOptions contents 
    mapM_ print $ map getTMSId $ filter isEvent events

isEvent :: SAXEvent String String -> Bool 
isEvent (StartElement "event" as) = True
isEvent _ = False

getTMSId :: SAXEvent String String -> Maybe String
getTMSId (StartElement _ as) = lookup "TMSId" as

My end goal is to parse a huge xml file with a simple sax-like interface. I don't want to have to be aware of the whole structure to get notified that I've found an "event".

هل كانت مفيدة؟

المحلول

I'm the maintainer of hexpat. This is a bug, which I have now fixed in hexpat-0.19.8. Thanks for drawing it to my attention.

The bug is new on ghc-7.2.1, and it's to do with an interaction that I didn't expect between a where clause binding to a triple, and unsafePerformIO, which I need to make the interaction with the C code appear pure in Haskell.

نصائح أخرى

This appears to be an issue with hexpat. Running compiled, with optimization, and just for a simple task such as length, results in linear memory use.

Looking at hexpat, I think there is excessive caching going on (see the parseG function). I suggest contacting the hexpat maintainer(s) and asking if this is expected behavior. It should have been mentioned in the haddocks either way, but resource consumption seems to get ignored too often in library documentation.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top