Question

Previously, in python 2.6, I had made a lot of use of urllib.urlopen to capture web page content and then later post process the data that I received. Now, those routines, and the new routines I am trying to use for python 3.2 are running into what seems to be a windows only (maybe even windows 7 only problem).

Using the following code with python 3.2.2 (64) on windows 7 ...

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read()
fp.close()
print(string.decode("utf8"))

I get the following message:

Traceback (most recent call last):
  File "TATest.py", line 5, in <module>
    string = fp.read()
  File "d:\python32\lib\http\client.py", line 489, in read
    return self._read_chunked(amt)
  File "d:\python32\lib\http\client.py", line 553, in _read_chunked
    self._safe_read(2)      # toss the CRLF at the end of the chunk
  File "d:\python32\lib\http\client.py", line 592, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)

Using the following code instead ...

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)
for Line in fp:
    print(Line.decode("utf8").rstrip('\n'))
fp.close()

I get a fair amount of the web page's content, but then the rest of the capture is thwarted by ...

Traceback (most recent call last):
  File "TATest.py", line 9, in <module>
    for Line in fp:
  File "d:\python32\lib\http\client.py", line 489, in read
    return self._read_chunked(amt)
  File "d:\python32\lib\http\client.py", line 545, in _read_chunked
    self._safe_read(2)  # toss the CRLF at the end of the chunk
  File "d:\python32\lib\http\client.py", line 592, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)

Trying to read another page yields ...

Traceback (most recent call last):
  File "TATest.py", line 11, in <module>
    print(Line.decode("utf8").rstrip('\n'))
  File "d:\python32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position
21: character maps to <undefined>

I do believe this is a windows issue, but can python be made more robust to deal with what is causing it? When trying similar code (version 2.6 code) on Linux, we do not encounter the problem. Is there a way around this? I have also posted to the gmane.comp.python.devel newsgroup

Was it helpful?

Solution

It looks like the page you're reading was encoded as cp1252.

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read()
fp.close()
print(string.decode("cp1252"))

Should work.

There are many ways to specify a content's charset, but using the HTTP header should suffice for most pages:

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read().decode(fp.info().get_content_charset())
fp.close()
print(string)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top