题
我有一个使用urllib2的Python Web客户端。向我的传出请求添加HTTP标头很容易。我只是创建了一个我想要添加的标题字典,并将其传递给Request初始化程序。
然而,其他“标准” HTTP标头会添加到请求以及我明确添加的自定义标头中。当我使用Wireshark嗅探请求时,除了我自己添加的标题之外,我还会看到标题。我的问题是如何访问这些标题?我想记录每个请求(包括完整的 HTTP标头集),但无法弄清楚如何。
任何指针?
简而言之:如何从urllib2创建的HTTP请求中获取所有传出标头?
解决方案
如果你想看到发出的文字HTTP请求,因此看到每个最后一个标题完全按照它在线上的表示,那么你可以告诉 urllib2
使用你自己的版本一个 HTTPHandler
打印出(或保存或其他)传出的HTTP请求。
import httplib, urllib2
class MyHTTPConnection(httplib.HTTPConnection):
def send(self, s):
print s # or save them, or whatever!
httplib.HTTPConnection.send(self, s)
class MyHTTPHandler(urllib2.HTTPHandler):
def http_open(self, req):
return self.do_open(MyHTTPConnection, req)
opener = urllib2.build_opener(MyHTTPHandler)
response = opener.open('http://www.google.com/')
运行此代码的结果是:
GET / HTTP/1.1
Accept-Encoding: identity
Host: www.google.com
Connection: close
User-Agent: Python-urllib/2.6
其他提示
urllib2库使用OpenerDirector对象来处理实际打开。幸运的是,python库提供了默认设置,因此您不必这样做。但是,这些OpenerDirector对象正在添加额外的标题。
要在发送请求后查看它们是什么(例如,您可以将其记录下来):
req = urllib2.Request(url='http://google.com')
response = urllib2.urlopen(req)
print req.unredirected_hdrs
(produces {'Host': 'google.com', 'User-agent': 'Python-urllib/2.5'} etc)
unredirected_hdrs是OpenerDirectors转储额外标头的地方。只需查看 req.headers
,就会只显示您自己的标题 - 图书馆会为您留下未经修改的标题。
如果在发送请求之前需要查看标题,则需要对OpenerDirector进行子类化以拦截传输。
希望有所帮助。
编辑:我忘了提一下,一旦请求被发送, req.header_items()
会给你一个所有标题的元组列表,包括你自己和添加的标题由OpenerDirector提供。我应该首先提到这一点,因为它是最直接的:-)抱歉。
编辑2:关于定义自己的处理程序的示例的问题之后,这是我提出的示例。任何对请求链进行修改的问题是我们需要确保处理程序对多个请求是安全的,这就是为什么我只是直接替换HTTPConnection类上的putheader定义感到不舒服。
可悲的是,因为HTTPConnection和AbstractHTTPHandler的内部非常内部,我们必须从python库中重现大部分代码来注入我们的自定义行为。假设我没有在下面进行过操作,这与我在5分钟测试中的工作方式相同,如果您将Python版本更新为修订版号(即:2.5.x至2.5.y或更高版本),请小心重新访问此覆盖2.5到2.6等)。
因此,我应该提到我使用的是Python 2.5.1。如果你有2.6或特别是3.0,你可能需要相应地调整它。
如果这不起作用,请告诉我。我对这个问题感到非常有趣:
import urllib2
import httplib
import socket
class CustomHTTPConnection(httplib.HTTPConnection):
def __init__(self, *args, **kwargs):
httplib.HTTPConnection.__init__(self, *args, **kwargs)
self.stored_headers = []
def putheader(self, header, value):
self.stored_headers.append((header, value))
httplib.HTTPConnection.putheader(self, header, value)
class HTTPCaptureHeaderHandler(urllib2.AbstractHTTPHandler):
def http_open(self, req):
return self.do_open(CustomHTTPConnection, req)
http_request = urllib2.AbstractHTTPHandler.do_request_
def do_open(self, http_class, req):
# All code here lifted directly from the python library
host = req.get_host()
if not host:
raise URLError('no host given')
h = http_class(host) # will parse host:port
h.set_debuglevel(self._debuglevel)
headers = dict(req.headers)
headers.update(req.unredirected_hdrs)
headers["Connection"] = "close"
headers = dict(
(name.title(), val) for name, val in headers.items())
try:
h.request(req.get_method(), req.get_selector(), req.data, headers)
r = h.getresponse()
except socket.error, err: # XXX what error?
raise urllib2.URLError(err)
r.recv = r.read
fp = socket._fileobject(r, close=True)
resp = urllib2.addinfourl(fp, r.msg, req.get_full_url())
resp.code = r.status
resp.msg = r.reason
# This is the line we're adding
req.all_sent_headers = h.stored_headers
return resp
my_handler = HTTPCaptureHeaderHandler()
opener = urllib2.OpenerDirector()
opener.add_handler(my_handler)
req = urllib2.Request(url='http://www.google.com')
resp = opener.open(req)
print req.all_sent_headers
shows: [('Accept-Encoding', 'identity'), ('Host', 'www.google.com'), ('Connection', 'close'), ('User-Agent', 'Python-urllib/2.5')]
这样的事情怎么样:
import urllib2
import httplib
old_putheader = httplib.HTTPConnection.putheader
def putheader(self, header, value):
print header, value
old_putheader(self, header, value)
httplib.HTTPConnection.putheader = putheader
urllib2.urlopen('http://www.google.com')
低级解决方案:
import httplib
class HTTPConnection2(httplib.HTTPConnection):
def __init__(self, *args, **kwargs):
httplib.HTTPConnection.__init__(self, *args, **kwargs)
self._request_headers = []
self._request_header = None
def putheader(self, header, value):
self._request_headers.append((header, value))
httplib.HTTPConnection.putheader(self, header, value)
def send(self, s):
self._request_header = s
httplib.HTTPConnection.send(self, s)
def getresponse(self, *args, **kwargs):
response = httplib.HTTPConnection.getresponse(self, *args, **kwargs)
response.request_headers = self._request_headers
response.request_header = self._request_header
return response
示例:
conn = HTTPConnection2("www.python.org")
conn.request("GET", "/index.html", headers={
"User-agent": "test",
"Referer": "/",
})
response = conn.getresponse()
response.status,response.reason:
1: 200 OK
response.request_headers:
[('Host', 'www.python.org'), ('Accept-Encoding', 'identity'), ('Referer', '/'), ('User-agent', 'test')]
response.request_header:
GET /index.html HTTP/1.1
Host: www.python.org
Accept-Encoding: identity
Referer: /
User-agent: test
另一个解决方案,女巫使用了如何在urllib2请求中获取默认标头?但是不从std-lib复制代码:
class HTTPConnection2(httplib.HTTPConnection):
"""
Like httplib.HTTPConnection but stores the request headers.
Used in HTTPConnection3(), see below.
"""
def __init__(self, *args, **kwargs):
httplib.HTTPConnection.__init__(self, *args, **kwargs)
self.request_headers = []
self.request_header = ""
def putheader(self, header, value):
self.request_headers.append((header, value))
httplib.HTTPConnection.putheader(self, header, value)
def send(self, s):
self.request_header = s
httplib.HTTPConnection.send(self, s)
class HTTPConnection3(object):
"""
Wrapper around HTTPConnection2
Used in HTTPHandler2(), see below.
"""
def __call__(self, *args, **kwargs):
"""
instance made in urllib2.HTTPHandler.do_open()
"""
self._conn = HTTPConnection2(*args, **kwargs)
self.request_headers = self._conn.request_headers
self.request_header = self._conn.request_header
return self
def __getattribute__(self, name):
"""
Redirect attribute access to the local HTTPConnection() instance.
"""
if name == "_conn":
return object.__getattribute__(self, name)
else:
return getattr(self._conn, name)
class HTTPHandler2(urllib2.HTTPHandler):
"""
A HTTPHandler which stores the request headers.
Used HTTPConnection3, see above.
>>> opener = urllib2.build_opener(HTTPHandler2)
>>> opener.addheaders = [("User-agent", "Python test")]
>>> response = opener.open('http://www.python.org/')
Get the request headers as a list build with HTTPConnection.putheader():
>>> response.request_headers
[('Accept-Encoding', 'identity'), ('Host', 'www.python.org'), ('Connection', 'close'), ('User-Agent', 'Python test')]
>>> response.request_header
'GET / HTTP/1.1\\r\\nAccept-Encoding: identity\\r\\nHost: www.python.org\\r\\nConnection: close\\r\\nUser-Agent: Python test\\r\\n\\r\\n'
"""
def http_open(self, req):
conn_instance = HTTPConnection3()
response = self.do_open(conn_instance, req)
response.request_headers = conn_instance.request_headers
response.request_header = conn_instance.request_header
return response
编辑:更新来源
请参阅urllib2.py:do_request(第1044行(1067))和urllib2.py:do_open(第1073行) (第293行)self.addheaders = [('User-agent',client_version)](仅添加'User-agent')
在我看来,你正在寻找响应对象的标题,包括 Connection:close
等。这些标题存在于urlopen返回的对象中。了解它们很容易:
from urllib2 import urlopen
req = urlopen("http://www.google.com")
print req.headers.headers
req.headers
是 httplib.HTTPMessage
它应发送默认的http标头(由 w3.org <指定) / a>)与您指定的那些一起。如果您希望完整地查看它们,可以使用 WireShark 等工具。
修改强>
如果您想记录它们,可以使用 WinPcap 来捕获特定应用程序发送的数据包(在你的情况下,python)。您还可以指定数据包的类型和许多其他详细信息。
-John