Question

The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.

Why is that? How to fix it easily?

Thank you for your help.

Same behavior using mechanize, curl, etc.

import urllib
import urllib2



start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html
Was it helpful?

Solution

I got the same garbled text using curl

curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm

The result appears to be gzipped. So this shows the correct HTML for me.

curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip

Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML

Edited by OP:

The revised answer after reading above is:

import urllib
import urllib2
import gzip
import StringIO

start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()

data = StringIO.StringIO(html)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()

html now holds the HTML (Print it to see)

OTHER TIPS

Try requests. Python Requests.

import requests
response = requests.get("http://www.ncert.nic.in/ncerts/textbook/textbook.htm")
print response.text

The reason for this is because the site uses gzip encoding. To my knowledge urllib doesn't support deflating so you end up with compressed html responses for certain sites that use that encoding. You can confirm this by printing the content headers from the response like so.

print response.headers

There you will see that the "Content-Encoding" is gzip format. In order to get around this using the standard urllib library you'd need to use the gzip module. Mechanize also does this because it uses the same urllib library. Requests will handle this encoding and format it nicely for you.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top