python urllib2.urlopen - html text is garbled - why?
-
21-12-2019 - |
Question
The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.
Why is that? How to fix it easily?
Thank you for your help.
Same behavior using mechanize, curl, etc.
import urllib
import urllib2
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html
Solution
I got the same garbled text using curl
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm
The result appears to be gzipped. So this shows the correct HTML for me.
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip
Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML
Edited by OP:
The revised answer after reading above is:
import urllib
import urllib2
import gzip
import StringIO
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
data = StringIO.StringIO(html)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
html
now holds the HTML (Print it to see)
OTHER TIPS
Try requests. Python Requests.
import requests
response = requests.get("http://www.ncert.nic.in/ncerts/textbook/textbook.htm")
print response.text
The reason for this is because the site uses gzip encoding. To my knowledge urllib doesn't support deflating so you end up with compressed html responses for certain sites that use that encoding. You can confirm this by printing the content headers from the response like so.
print response.headers
There you will see that the "Content-Encoding" is gzip format. In order to get around this using the standard urllib library you'd need to use the gzip module. Mechanize also does this because it uses the same urllib library. Requests will handle this encoding and format it nicely for you.