python urllib2 returns garbage

https://stackoverflow.com/questions/21961958

15-10-2022
|

Frage

I am trying to download a web page with python and access some elements on the page. I have an issue when I download the page: the content is garbage. This is the first lines of the page:

‹í}évÛH²æïòSd±ÏmÉ·’¸–%ÕhµÕ%ÙjI¶«JããIÐ(‰îî{æ1æ÷¼Æ¼Í}’ù"à""’‚d÷t»N‰$–\"ãËˆŒˆŒÜøqïíîùï'û¬¼gôÁnžm–úq<ü¹R¹¾¾._›å ìUôv»]¹¡gJÌqÃÍ’‡%z‹[ÎÖ3†[(,jüËÈ½Ú,í~ÌýX;y‰Ùò×f)æ7q…JzÉì¾F<ÞÅ]Uª

this problem happen only on the following website: http://kickass.to. Is it possible that they have somehow protected their page? this is my python code:

import urllib2
import chardet
url = 'http://kickass.to/'
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KH
TML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
f = open('page.html','w')
f.write(page)
f.close()
print response.headers['content-type']
print chardet.detect(page)

and result:

text/html; charset=UTF-8
{'confidence': 0.0, 'encoding': None}

it looks like an encoding issue but chardet detects 'None'.. Any ideas?

Lösung

This page is returned in gzip encoding.

(Try printing out response.headers['content-encoding'] to verify this.)

Most likely the web-site doesn't respect 'Accept-Encoding' field in request and suggests that the client supports gzip (most modern browsers do).

urllib2 doesn't support deflating, but you can use gzip module for that as described e.g. in this thread: Does python urllib2 automatically uncompress gzip data fetched from webpage? .

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow