python 3.3 urllib read html in unknown character sets

https://stackoverflow.com/questions/21373434

03-10-2022
|

Question

i use the following code python 3.3 to read the nyu's home page. however, it show incorrect output in unknown character set. the response header content-type is UTF-8. the code can read other htmls correctly, but not for the nyu page. could you help my why?

url='http://www.stern.nyu.edu/'
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0'),
      ('Content-Type', 'text/html; charset=UTF-8')]
r=opener.open(url)
r.read().decode('UTF-8')

and the result snippet is here:

 ¢=¿/JçW®<× 4ô9ïÛ9$*Á¹³÷î·ïõ¡(ÂÄÀPZÓ¯seßVÿ_<ÅsÎF"t¢ÂQýMâ°AÈX¨ÕA ¨IØ ³ <ðGÀp«�¾X(ÛìÊß}XkfÌ=] Ð0.|¿v°f©ÛTüAH

Solution

The response is Gzipped, so there's no point in trying to decode it as UTF-8. You can either decompress it yourself:

from io import StringIO
import gzip

with gzip.GzipFile(fileobj=r) as handle:
    html = handle.read()

Or use something like Requests, which does it for you:

import requests

html = requests.get('http://www.stern.nyu.edu/', headers={
    'User-agent': 'Mozilla/5.0'
}).text

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow