Attoparsec allocates a ton of memory on large 'take' call

https://stackoverflow.com/questions/4151265

08-10-2019
|

Question

So I am writing a packet sniffing app. Basically I wanted it to sniff for tcp sessions, and then parse them to see if they are http, and if they are, and if they have the right content type, etc, save them as a file on my hard drive.

So, to that end, I wanted it to be efficient. Since the current http library is string based, and I will be dealing with large files, and I only really needed to parse http responses, I decided to roll my own in attoparsec.

When I finished my program, I found that when I was parsing a 9 meg http response with a wav file in it, when I profiled it, it was allocating a gig of memory when it was trying to parse out the body of the http response. When I look at the HTTP.prof I see some lines:

httpBody              Main                                                 362           1   0.0    0.0    93.8   99.3

 take                 Data.Attoparsec.Internal                             366        1201   0.0    0.0    93.8   99.3
     takeWith            Data.Attoparsec.Internal                             367        3603   0.0    0.0    93.8   99.3
      demandInput        Data.Attoparsec.Internal                             375         293   0.0    0.0    93.8   99.2
       prompt            Data.Attoparsec.Internal                             378         293   0.0    0.0    93.8   99.2
        +++              Data.Attoparsec.Internal                             380         586  93.8   99.2    93.8   99.2

So as you can see, somewhere within httpbody, take is called 1201 times, causing 500+ (+++) concatenations of bytestrings, which causes an absurd amount of memory allocation.

Here's the code. N is just the content length of the http response, if there is one. If there isn't one it just tries to take everything.

I wanted it to return a lazy bytestring of 1000 or so character bytestrings, but even if I change it to just take n and return a strict bytestring, it still has those allocations in it (and it uses 14 gig of memory).


httpBody n = do
  x <- if n > 0
    then AC.take n
    else AC.takeWhile (\_ -> True)
  if B.length x == 0
    then return Nothing
    else return (Just x)

I was reading a blog by the guy who did combinatorrent and he was having the same issue, but I never heard of a resolution. Has anyone ever run across this problem before or found a solution?

Edit: Okay, well I left this up the entire day and got nothing. After researching the problem I don't think there is a way to do it without adding a lazy bytestring accessor to attoparsec. I also looked at all the other libraries and they either lacked bytestrings or other things.

So I found a workaround. If you think about an http request, it goes headers, newline, newline, body. Since the body is last, and parsing returns a tuple with both what you parsed and what is remaining of the bytestring, I can skip parsing the body inside attoparsec and instead pluck the body straight off the bytestring that is left.


parseHTTPs bs = if P.length results == 0
  then Nothing
  else Just results
  where results = foldParse(bs, [])

foldParse (bs,rs) = case ACL.parse httpResponse bs of
  ACL.Done rest r -> addBody (rest,rs) r
  otherwise ->  rs

addBody (rest,rs) http = foldParse (rest', rs')
  where
    contentlength = ((read . BU.toString) (maybe "0" id (hdrContentLength (rspHeaders http))))
    rest' = BL.drop contentlength rest
    rs' = rs ++ [http { rspBody = body' }]
    body'
      | contentlength == 0  = Just rest
      | BL.length rest == 0 = Nothing
      | otherwise           = Just (BL.take contentlength rest)
httpResponse = do
  (code, desc) <- statusLine
  hdrs <- many header
  endOfLine
--  body <- httpBody ((read . BU.toString) (maybe "0" id (hdrContentLength parsedHeaders)))

  return Response { rspCode = code, rspReason = desc, rspHeaders = parseHeaders hdrs,  rspBody = undefined }

It is a little messy, but ultimately it works fast and allocates nothing more than I wanted. So basically you fold over the bytestring collecting http data structures, then in between collections, I check the content length of the structure I just got, pull an appropriate amount from the remaining bytestring, and then continue on if there is any bytestring left.

Edit: I actually finished up this project. Works like a charm. I isn't cabalized properly but if someone wants to view the entire source, you can find it at https://github.com/onmach/Audio-Sniffer.

Solution

combinatorrent guy here :)

If memory serves, the problem with attoparsec is that demands input a little bit at a time, building up a lazy bytestring which is finally concatenated. My "solution" was to roll the input function myself. That is, I get the input stream for attoparsec from a network socket and I know how many bytes to expect in a message. Basically, I split into two cases:

The message is small: Read up to 4k from the socket and eat that Bytestring a little bit at a time (slices of bytestrings are fast and we throw away the 4k after it has been exhausted).
The message is "large" (large here means around 16 Kilobyte in bittorrent speak): We calculate how much the 4k chunk we have can fulfill, and then we simply request the underlying network socket to fill things in. We now have two bytestrings, the remaining part of the 4k chunk and the large chunk. They have all data, so concatenating those and parsing them in is what we do.

You may be able to optimize the concatenation step away.

The TL;DR version: I handle it outside attoparsec and handroll the loop to avoid the problem.

The relevant combinatorrent commit is fc131fe24, see

https://github.com/jlouis/combinatorrent/commit/fc131fe24207909dd980c674aae6aaba27b966d4

for the details.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow