Question

I intend to fetch a large amount of data over HTTP/HTTPS using http-conduit. In order to do this efficiently, I want to use the Accept-Encoding: deflate,gzip header to allow the server (if supported) to transfer the data in a compressed way.

However, some of the servers I want to fetch from seem to incorrectly respond with the Content-Encoding: gzip header while not returning gzip data.

Therefore I need to handle these cases:

  • Server does not support compression --> Return plain response body
  • Server returns gzipped/deflated content --> Return decompressed response body
  • Server says (in response headers it returns gzipped content, but gzip decoding fails --> Return plain response body

In the third case, it can (in this specific case) safely be assumed, that the plaintext, uncompressed data does not look like gzip data, so Response headers say it is gzipped && un-gzip fails is fully equivalent to The data is not compressed.

How can I do this using http-conduit?

Note: This question intentionally does not show research effort because it has been answered immediately in a Q&A-style way.

Was it helpful?

Solution

In order to make this answer more concise, we will use code & concepts from some of my previous posts:

  • simpleHttpWithManager from here
  • Tolerant gzip/deflate decoding from here

To avoid redundancy, I will first explain the basic steps and then provide a full example.

First, we shall handle sending the headers. Note that wile http-types contains hContentEncoding, hAcceptEncoding is not predefined. Besides that, this is a trivial task.

After sending the request, we need to check if there is a Content-Encoding. If there is none, we shall assume uncompressed plaintext, else we need to check if it is gzip or deflate. Which one it is exactly does not matter in this context as [zlib] supports automatic detection by header.

For this rather simple example we just assume that if the server returns a Content-Encoding value that is neither absent nor gzip nor deflate, the response is not compressed. As we did not allow (by Accept-Encoding) other compressions like sdch, the server would be violating the HTTP standard by acting that way.

If we detect a compressed encoding, we try to decompress and return it. If this fails or if the data is not compressed at all, we return the plain response body.

Here's the example:

{-# LANGUAGE OverloadedStrings #-}
import Network.HTTP.Conduit
import Network.Connection
import Codec.Compression.Zlib.Internal
import Data.Maybe
import Data.Either
import Network.HTTP.Types
import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Lazy.Char8 as LB

myurl :: String
myurl = "http://stackoverflow.com"

hAcceptEncoding :: HeaderName
hAcceptEncoding = "Accept-Encoding"

-- | The Accept-Encoding HTTP header value for allowing gzip or deflated responses
gzipDeflateEncoding :: ByteString
gzipDeflateEncoding = "gzip,deflate"

-- HTTP header list that allows gzipped/deflated response
compressionEnabledHeaders :: RequestHeaders
compressionEnabledHeaders = [(hAcceptEncoding, gzipDeflateEncoding)]

-- | Give an encoding string and a HTTP response object,
--   Checks if the Content-Encoding header value of the response object
--   is equal to the given encoding. Returns false if no ContentEncoding
--   header exists in the given response, or if the value does not match
--   the encoding parameter.
hasResponseEncoding :: ByteString -> Response b -> Bool
hasResponseEncoding encoding response =
    let responseEncoding = lookup hContentEncoding headers
        headers = responseHeaders response
    in maybe False (== encoding) responseEncoding

-- | Convert the custom error format from zlib to a Either
decompressStreamToEither :: DecompressStream -> Either String LB.ByteString
decompressStreamToEither (StreamError _ errmsg) = Left errmsg
decompressStreamToEither stream@(StreamChunk _ _) = Right $ fromDecompressStream stream
decompressStreamToEither StreamEnd = Right $ ""

-- | Decompress with explicit error handling
safeDecompress :: LB.ByteString -> Either String LB.ByteString
safeDecompress bstr = decompressStreamToEither $ decompressWithErrors gzipOrZlibFormat defaultDecompressParams bstr

-- | Decompress gzip, if it fails, return uncompressed String
decompressIfPossible :: LB.ByteString -> LB.ByteString
decompressIfPossible bstr =
    let conv (Left a) = bstr
        conv (Right a) = a
    in (conv . safeDecompress) bstr

-- | Tolerantly decompress response body. As some HTTP servers set the header incorrectly,
--   just return the plain response text if the compression fails
decompressResponseBody :: Response LB.ByteString -> LB.ByteString
decompressResponseBody res
    | hasResponseEncoding "gzip" res = decompressIfPossible $ responseBody res
    | hasResponseEncoding "deflate" res = decompressIfPossible $ responseBody res
    | otherwise = responseBody res

-- | Download like with simpleHttp, but using an existing manager for the task
--   and automatically requesting & handling gzipped data
simpleHttpWithAutoGzip :: Manager -> String -> IO LB.ByteString
simpleHttpWithAutoGzip manager url = do req <- parseUrl url
                                        let req' = req {requestHeaders = compressionEnabledHeaders}
                                        fmap decompressResponseBody $ httpLbs req' manager

-- Example usage
main :: IO ()
main = do manager <- newManager conduitManagerSettings -- Create a simple manager
          content <- simpleHttpWithAutoGzip manager "http://stackoverflow.com"
          -- Print the uncompressed content
          print $ content
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top