Question

I wrote a simple program in which I read a big XML file and do some processing on the contents of the file and then save the processed data in new file.

The original main function follows something like this:

main = do
  content <- B.readFile "/home/sibi/github/data/chennai.osm" 
  let tags = removeUnwanted $ parseTags content 
      hospitals = toHospital $ extractHospitalNode tags
  BL.writeFile "osmHospitals.json" (encode hospitals)

But this code eats up the full memory and takes a huge time to finish. So, I decided to use conduit library for making the program run in constant memory.

But after reading the conduit tutorial, I still haven't got the idea how to make the above program use the features of conduit library.

I figured out that I can use conduit's sourceFile which can stream the content of the file. But then how to apply the function parseTags (which is a function from the TagSoup library) and other simple functions to the streamed content now ?

Edit: The entire code is here

Was it helpful?

Solution

There's a huge disconnect between the methodology of parseTags and the methodology of conduit and pipes: parseTags assumes it can access the next chunk of data purely, while pipes/conduit let you handle situations where that's impossible, such as streaming from a file. In order to mix parsing into pipes/conduit you must have a way to mix consuming a parse into steps which pull new chunks of data.

(I'll use pipes in the sequel because I'm more familiar with them, but the idea is transferable.)

We can see this disconnect in the types, though I'll begin with a slightly restricted version.

parseTags :: Lazy.ByteString -> [Tag Lazy.ByteString]

We can think of Lazy.ByteString as streaming apparatus all by itself, it is, after all, essentially just

type LazyByteString = [Strict.ByteString]

such that if we were generating the Lazy.ByteString ourselves then we could rely on the laziness of lists to ensure that we don't generate more than what parseTags needed in order to proceed (I'll assume, without looking, that parseTags is written so that it could incrementally parse a streaming structure like that).

sillyGen :: LazyByteString
sillyGen = gen 10 where
  gen 0 = []
  gen n = "<tag> </tag>" : gen (n-1)

Now the problem here is that the streaming behavior of a list depends crucially upon being able to generate the tail of the list purely. In the discussion so far there hasn't been any mention of a monad at all. Unfortunately, that cannot be true with a string being streamed from a file---we need to somehow integrate an IO action between each streamed chunk where we consider whether or not we've reached EOF and close the file as necessary.

This is exactly the realm of pipes and conduit, so let's look at what that do to solve that issue.

-- from pipes-bytestring
fromHandle :: Handle -> Producer' Strict.ByteString IO ()

We can think of fromHandle as being the "monadically-interwoven" equivalent to

Lazy.hGetContents :: Handle -> IO Lazy.ByteString

The types suggest a crucial difference between these two operations--hGetContents can be executed in exactly one IO action while when we pass a Handle to pipes-bytestring's fromHandle it returns a type which is parameterized over IO but cannot be simply freed from it. This is exactly indicative of hGetContents using lazy IO (which can be unpredictable due to the use of unsafeInterleaveIO) while fromHandle uses deterministic streaming.

We can write a type similar to Producer Strict.ByteString IO () as

data IOStreamBS = IOSBS { stepStream :: IO (Strict.ByteString, Either IOStreamBS ()) }

In other words we can think of Producer Strict.ByteString IO () as not much more than an IO action which produces exactly the next chunk of the file and (possibly) a new action to get the next chunk. This is how pipes and conduit provide deterministic streaming.

But it also means that you cannot escape from the IO in one fell swoop—you have to carry it around.


We might thus want to adjust parseTags, which is capable of some generalization over its input, to just accept Producer Strict.ByteString IO () as a StringLike type

parseTags :: StringLike str => str -> [Tag str]

Let's assume for argument that we've instantiated StringLike (Producer Strict.ByteString IO ()). That would mean that applying parseTags to our producer would provide us with a list of Tag (Producer Strict.ByteString IO ()).

type DetStream = Producer Strict.ByteString IO ()
parseTags :: DetStream -> [Tag DetStream]

For this to happen we would have had to peek into our Producer and cut it up into chunks without executing anything in the IO monad. By this point it should be clear that such a function is impossible---we couldn't even get the first chunk from the file without doing something in IO.


To remedy this situation, systems like pipes-parse and pipes-group have arisen which replace the function signature with something more like

parseTagsGrouped :: Producer Strict.ByteString IO () 
                 -> FreeT (Producer (Tag Strict.ByteString) IO) IO ()

which is scary looking but serves an identical purpose to parseTags except that it generalizes the list to a structure which allows us to execute arbitrary IO actions between each element. This kind of transformation, as the type shows, can be done purely and thus allows us to assemble our streaming machinery using pure combinations and only incur an IO step when we execute it at the end (using runEffect).


So, all said and done, it's probably not going to be possible to use pipes or conduit to stream to parseTags---it simply assumes that certain transformations can be done purely, pushing all the IO to one point in time, while pipes/conduit are basically mechanisms for spreading IO throughout a computation without too much mental overhead.

If you're stuck using parseTags, however, you can get by using lazy IO as long as you're careful. Try a few variations with hGetContents from Data.ByteString.Lazy. The primary problem will be that the file may close prior to the unsafeInterleaveIO'd operations actually getting around to reading it. You'll thus need to manage strictness very carefully.

Essentially that's the big difference between pipes/conduit and lazy IO. When using lazy IO, all of the "read a chunk" operations are made invisible and implicitly controlled by Haskell laziness. This is dynamic, implicit, and tough to observe or predict. In pipes/conduit all of this motion is made extraordinarily explicit and static, but it's up to you to manage the complexity.

OTHER TIPS

What if you try System.IO, read the file line by line and process it(or read parts of the xml file)?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top