Processing (too) many XML files (with TagSoup)

https://stackoverflow.com/questions/5943250

31-10-2019
|

Question

I have a directory with about 4500 XML (HTML5) files, and I want to create a "manifest" of their data (essentially title and base/@href).

To this end, I've been using a function to collect all the relevant file paths, opening them with readFile, sending them into a tagsoup based parser and then outputting/formatting the resultant list.

This works for a subset of the files, but eventually runs into a openFile: resource exhausted (Too many open files) error. After doing some reading, this isn't so surprising: I'm using mapM parseMetaDataFile files which opens all the handles straight away.

What I can't figure out is how to work around the problem. I've tried reading a bit about Iteratee; Can I hook that up with Tagsoup easily? Strict IO, the way I used it anyway (heh), froze my computer even though the files aren't very big (28 KB on average).

Any pointers would be greatly appreciated. I realize the approach of creating a big list might fail as well, but 4.5k elements isn't that long... Also, there should probably be less String and more ByteString everywhere.

Here's some code. I apologize for the naivety:

import System.FilePath
import Text.HTML.TagSoup

data MetaData = MetaData String String deriving (Show, Eq)

-- | Given HTML input, produces a MetaData structure of its essentials.
-- Should obviously account for errors, but simplified here.
readMetaData :: String -> MetaData
readMetaData input = MetaData title base
 where
  title =
    innerText $
    (takeWhile (~/= TagClose "title") . dropWhile (~/= TagOpen "title" []))
    tags
  base = fromAttrib "href" $ head $ dropWhile (~/= TagOpen "base" []) tags
  tags = parseTags input

-- | Parses MetaData from a file.
parseMetaDataFile :: FilePath -> IO MetaData
parseMetaDataFile path = fmap readMetaData $ readFile path

-- | From a given root, gets the FilePaths of the files we are interested in.
-- Not implemented here.
getHtmlFilePaths :: FilePath -> IO [FilePath]
getHtmlFilePaths root = undefined

main :: IO
main = do
  -- Will call openFile for every file, which gives too many open files.
  metas <- mapM parseMetaDataFile =<< getHtmlFilePaths

  -- Do stuff with metas, which will cause files to actually be read.

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow