Question

I am trying to download an html page and then run a regex on it using Racket. This has worked for some pages but not for others. I eventually worked out this is because some pages are gzipped and issuing an HTTP GET request with get-pure-port provides the gzipped page, which of course looks like gibberish.

My question: is there a way to unzip the page in racket so that I can run regex on it?

Thank you.

Was it helpful?

Solution

Although well-behaved web servers won't give you gzipped responses unless you gave them an Accept-Encoding: gzip request header, not every web server is well-behaved.

So, you need to look for the Content-Encoding: gzip response header and use gunzip-through-ports. (You can do the same for Content-Encoding: deflate and inflate.)

Of course, to "look for the response header" you can't use get-pure-port anymore, you have to use get-impure-port and purify-port. Pseudo-code:

#lang racket

(require net/url
         net/head
         file/gunzip)

(define u (string->url "http://www.wikipedia.org"))
(define in (get-impure-port u '("Accept-Encoding: gzip")))
(define h (purify-port in))
(define out (open-output-bytes))
(match (extract-field "Content-Encoding" h)
  ["gzip" (gunzip-through-ports in out)]
  [_      (copy-port in out)])
(define bstr (get-output-bytes out))
(close-input-port in)

p.s. I think the above is easier to explore when trying it out for the first time. But for production code I'd probably use call/input-url to help handle closing the port:

#lang racket

(require net/url
         net/head
         file/gunzip)

(define u (string->url "http://www.wikipedia.org"))
(define bstr
  (call/input-url u
                  (curryr get-impure-port '("Accept-Encoding: gzip"))
                  (lambda (in)
                    (define h (purify-port in))
                    (define out (open-output-bytes))
                    (match (extract-field "Content-Encoding" h)
                      ["gzip" (gunzip-through-ports in out)]
                      [_      (copy-port in out)])
                    (get-output-bytes out))))

p.p.s.

That version might be even clearer if it didn't use curryr and an anonymous function. For example:

#lang racket

(require net/url
         net/head
         file/gunzip)

;; Like get-impure-port, but supplied Accept-Encoding gzip request
;; header.
(define (get-impure-port/gzip u)
  (get-impure-port u '("Accept-Encoding: gzip")))

;; Read response headers using purify-port, and read the response
;; entity handling gzip encoding.
(define (read-response in)
  (define h (purify-port in))
  (define out (open-output-bytes))
  (match (extract-field "Content-Encoding" h)
    ["gzip" (gunzip-through-ports in out)]
    [_      (copy-port in out)])
  (get-output-bytes out))

(define bstr
  (call/input-url (string->url "http://www.wikipedia.org")
                  get-impure-port/gzip
                  read-response))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top