Why does this url raise BadStatusLine with httplib2 and urllib2?
Pregunta
Using httplib2 and urllib2, I'm trying to fetch pages from this url, but all of them didn't work out and ended up with this exception.
content = conn.request(uri="http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1129, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 901, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 871, in _conn_request
response = conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1027, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
HTTP header was like this
http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902
GET /news/news_print.asp?artice_id=20110727092902 HTTP/1.1
Host: www.zdnet.co.kr
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: ko-kr,ko;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: RMID=7d83495d4f336fe0; __utma=37206251.1552605885.1328771258.1328771258.1329070845.2; __utmz=37206251.1328771258.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); ASPSESSIONIDCSQCQTDD=BCLEHPPDEPHEBJDLCFNDMKDN; __utmc=37206251; ASPSESSIONIDSSQCQQCB=MJPLMOJAFPDFCLONCANBIKHN; _EXEN=2
X-FireLogger: 1.2
HTTP/1.1 200 OK
Date: Mon, 13 Feb 2012 18:02:56 GMT
Content-Length: 19158
Content-Type: text/html;charset=UTF-8; Charset=UTF-8
Set-Cookie: ASPSESSIONIDSQSDQRDB=NGAIFHKAGDIOGEMANAOLLKKF; path=/
Cache-Control: private
Any clue?
Solución
This works fine for me:
import urllib2
opener = urllib2.build_opener()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1',
}
opener.addheaders = headers.items()
response = opener.open("http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902")
print response.headers
print response.read()
The website discards all requests that occur without a User-Agent
string.
Otros consejos
For all the people that end up here with a similar problem after installing httplib2 0.8:
Version 0.8 has a regression with connection handling in relation with HTTP keep-alive. See the bug report: https://code.google.com/p/httplib2/issues/detail?id=250
There is a fix for this issue, but it has not been released so far. Until then just use httplib2 0.7.7.
In my code,when i use
from urllib2 import urlopen
content = urlopen(page).read()
the exception appears. However, when i use
import urllib
content = urllib.urlopen(page).read()
everything is ok. Maybe it will help u.
Look like this webpage doesn't allow your user agent. You can change it like this:
>>> import urllib2
>>> user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
>>> headers = { 'User-Agent' : user_agent }
>>> r = urllib2.Request('http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902', headers=headers)
>>> fd = urllib2.urlopen(r)
>>> print fd[20:]
'<!DOCTYPE html PUBLI'