Question

I am creating a system where all urls, html, text, links, etc are stored in unicode format. For that purpose, I extract html from a web page and convert it to unicode using the code pasted here. A few links I tried work fine. Others like the link in my source code below throw up errors. How can I fix this problem?

import urllib2
from cookielib import CookieJar
cj = CookieJar()
url = 'http://www.economist.com/news/leaders/21596515-there-are-lessons-many-governments-one-countrys-100-years-decline-parable'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11 Chrome/32.0.1700.77 Safari/537.36'), ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'), ('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'), ('Accept-Encoding','gzip,deflate,sdch'), ('Connection', 'keep-alive')]
resp = opener.open(url, timeout=5)
raw_html = resp.read()
raw_html.decode('utf-8')

gives the error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte
Was it helpful?

Solution

The return data is compress by GZip.

  1. You can try to decompress it:

    try:
        raw_html = GzipFile(fileobj=StringIO(raw_html)).read()
    except:
        pass
    
  2. Or, you can send header Accept-Encoding: deflate (without 'gzip')

    opener.addheaders = [('Accept-Encoding', 'deflate'), ]
    
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top