Question

I need to routinely access and parse XML data from a website of the form:

https://api.website.com/stuff/getCurrentData?security_key=blah

I cannot post the actual connections because of the secure nature of the data. When I put this url into my browser (Safari), I get XML returned.

When I call this through urllib2, I get junk.

f = urllib2.urlopen("https://api.website.com/stuff/getCurrentData?security_key=blah") 
s = f.read()
f.close()
s
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xc5\x96mo\xda0\x10\xc7\xdf\xf7SX\xbc\xda4\x15\xc7y\x00R\xb9\xae\xfa\xb4U\x1a-\x150M{5y\xe1\x06V\x13\x079\x0e\x14>\xfd\x9c\x84\xb0\xd2\xa4S\xa4L\xe5\x95\xef\xeeo 

This post Urllib's urlopen breaking on some sites (e.g. StackApps api): returns garbage results seems to be a similar problem, but it refers to JSON instead of XML. Following the instructions to look at headers, I think that I am getting GZIP data returned. {I did the test suggested, posted here}

req = urllib2.Request("https://api.website.com/stuff/getCurrentData?security_key=blah",
                      headers={'Accept-Encoding': 'gzip, identity'})
conn = urllib2.urlopen(req)
val = conn.read()
conn.close()
val[0:25]
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xc5\x96]o\xda0\x14\x86\xef\xfb+,\xae6M'

In that post, there was some suggestion that this could be a local problem, so I tried an example site.

f = urllib2.urlopen("http://www.python.org")
s = f.read()
f.close()
s
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n\n<head>\n  <meta http-equiv="content-type" content="text/html; charset=utf-8" />\n  <title>Python Programming Language &ndash; Official Website</title>\n  

This works just fine, so I think it has something to do with the site API that I am actually trying to access.

This post Why does text retrieved from pages sometimes look like gibberish? suggested that I might need to do something with "Selenium" but then the poster said the problem "fixed itself" which does not help me figure out what is wrong.

Am I not able to use python to download secure data? Do I need to use something different besides urlib2 and url open?

I am running python 2.7 on Mac OSX 10.7.5

Was it helpful?

Solution

You are retrieving GZIPped, compressed data; the server expressly tells you it does with Content-Encoding: gzip. Either use the zlib library to decompress the data:

import zlib

decomp = zlib.decompressobj(16 + zlib.MAX_WBITS)
data = decomp.decompress(val)

or use a library that supports transparent decompression if the response headers indicate compression has been used, like requests.

OTHER TIPS

'\x1f\x8b\' is indeed the magic header for gzip, so you are getting gzip data back.

In your second example you explicitly accept gzip encoded data, change that to 'Accept-Encoding': 'identity' and see if it makes a difference.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top