Question

I have a Python web client that uses urllib2. It is easy enough to add HTTP headers to my outgoing requests. I just create a dictionary of the headers I want to add, and pass it to the Request initializer.

However, other "standard" HTTP headers get added to the request as well as the custom ones I explicitly add. When I sniff the request using Wireshark, I see headers besides the ones I add myself. My question is how do a I get access to these headers? I want to log every request (including the full set of HTTP headers), and can't figure out how.

any pointers?

in a nutshell: How do I get all the outgoing headers from an HTTP request created by urllib2?

Was it helpful?

Solution

If you want to see the literal HTTP request that is sent out, and therefore see every last header exactly as it is represented on the wire, then you can tell urllib2 to use your own version of an HTTPHandler that prints out (or saves, or whatever) the outgoing HTTP request.

import httplib, urllib2

class MyHTTPConnection(httplib.HTTPConnection):
    def send(self, s):
        print s  # or save them, or whatever!
        httplib.HTTPConnection.send(self, s)

class MyHTTPHandler(urllib2.HTTPHandler):
    def http_open(self, req):
        return self.do_open(MyHTTPConnection, req)

opener = urllib2.build_opener(MyHTTPHandler)
response = opener.open('http://www.google.com/')

The result of running this code is:

GET / HTTP/1.1
Accept-Encoding: identity
Host: www.google.com
Connection: close
User-Agent: Python-urllib/2.6

OTHER TIPS

The urllib2 library uses OpenerDirector objects to handle the actual opening. Fortunately, the python library provides defaults so you don't have to. It is, however, these OpenerDirector objects that are adding the extra headers.

To see what they are after the request has been sent (so that you can log it, for example):

req = urllib2.Request(url='http://google.com')
response = urllib2.urlopen(req)
print req.unredirected_hdrs

(produces {'Host': 'google.com', 'User-agent': 'Python-urllib/2.5'} etc)

The unredirected_hdrs is where the OpenerDirectors dump their extra headers. Simply looking at req.headers will show only your own headers - the library leaves those unmolested for you.

If you need to see the headers before you send the request, you'll need to subclass the OpenerDirector in order to intercept the transmission.

Hope that helps.

EDIT: I forgot to mention that, once the request as been sent, req.header_items() will give you a list of tuples of ALL the headers, with both your own and the ones added by the OpenerDirector. I should have mentioned this first since it's the most straightforward :-) Sorry.

EDIT 2: After your question about an example for defining your own handler, here's the sample I came up with. The concern in any monkeying with the request chain is that we need to be sure that the handler is safe for multiple requests, which is why I'm uncomfortable just replacing the definition of putheader on the HTTPConnection class directly.

Sadly, because the internals of HTTPConnection and the AbstractHTTPHandler are very internal, we have to reproduce much of the code from the python library to inject our custom behaviour. Assuming I've not goofed below and this works as well as it did in my 5 minutes of testing, please be careful to revisit this override if you update your Python version to a revision number (ie: 2.5.x to 2.5.y or 2.5 to 2.6, etc).

I should therefore mention that I am on Python 2.5.1. If you have 2.6 or, particularly, 3.0, you may need to adjust this accordingly.

Please let me know if this doesn't work. I'm having waaaayyyy too much fun with this question:

import urllib2
import httplib
import socket


class CustomHTTPConnection(httplib.HTTPConnection):

    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self.stored_headers = []

    def putheader(self, header, value):
        self.stored_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)


class HTTPCaptureHeaderHandler(urllib2.AbstractHTTPHandler):

    def http_open(self, req):
        return self.do_open(CustomHTTPConnection, req)

    http_request = urllib2.AbstractHTTPHandler.do_request_

    def do_open(self, http_class, req):
        # All code here lifted directly from the python library
        host = req.get_host()
        if not host:
            raise URLError('no host given')

        h = http_class(host) # will parse host:port
        h.set_debuglevel(self._debuglevel)

        headers = dict(req.headers)
        headers.update(req.unredirected_hdrs)
        headers["Connection"] = "close"
        headers = dict(
            (name.title(), val) for name, val in headers.items())
        try:
            h.request(req.get_method(), req.get_selector(), req.data, headers)
            r = h.getresponse()
        except socket.error, err: # XXX what error?
            raise urllib2.URLError(err)
        r.recv = r.read
        fp = socket._fileobject(r, close=True)

        resp = urllib2.addinfourl(fp, r.msg, req.get_full_url())
        resp.code = r.status
        resp.msg = r.reason

        # This is the line we're adding
        req.all_sent_headers = h.stored_headers
        return resp

my_handler = HTTPCaptureHeaderHandler()
opener = urllib2.OpenerDirector()
opener.add_handler(my_handler)
req = urllib2.Request(url='http://www.google.com')

resp = opener.open(req)

print req.all_sent_headers

shows: [('Accept-Encoding', 'identity'), ('Host', 'www.google.com'), ('Connection', 'close'), ('User-Agent', 'Python-urllib/2.5')]

How about something like this:

import urllib2
import httplib

old_putheader = httplib.HTTPConnection.putheader
def putheader(self, header, value):
    print header, value
    old_putheader(self, header, value)
httplib.HTTPConnection.putheader = putheader

urllib2.urlopen('http://www.google.com')

A low-level solution:

import httplib

class HTTPConnection2(httplib.HTTPConnection):
    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self._request_headers = []
        self._request_header = None

    def putheader(self, header, value):
        self._request_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)

    def send(self, s):
        self._request_header = s
        httplib.HTTPConnection.send(self, s)

    def getresponse(self, *args, **kwargs):
        response = httplib.HTTPConnection.getresponse(self, *args, **kwargs)
        response.request_headers = self._request_headers
        response.request_header = self._request_header
        return response

Example:

conn = HTTPConnection2("www.python.org")
conn.request("GET", "/index.html", headers={
    "User-agent": "test",
    "Referer": "/",
})
response = conn.getresponse()

response.status, response.reason:

1: 200 OK

response.request_headers:

[('Host', 'www.python.org'), ('Accept-Encoding', 'identity'), ('Referer', '/'), ('User-agent', 'test')]

response.request_header:

GET /index.html HTTP/1.1
Host: www.python.org
Accept-Encoding: identity
Referer: /
User-agent: test

A other solution, witch used the idea from How do you get default headers in a urllib2 Request? But doesn't copy code from std-lib:

class HTTPConnection2(httplib.HTTPConnection):
    """
    Like httplib.HTTPConnection but stores the request headers.
    Used in HTTPConnection3(), see below.
    """
    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self.request_headers = []
        self.request_header = ""

    def putheader(self, header, value):
        self.request_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)

    def send(self, s):
        self.request_header = s
        httplib.HTTPConnection.send(self, s)


class HTTPConnection3(object):
    """
    Wrapper around HTTPConnection2
    Used in HTTPHandler2(), see below.
    """
    def __call__(self, *args, **kwargs):
        """
        instance made in urllib2.HTTPHandler.do_open()
        """
        self._conn = HTTPConnection2(*args, **kwargs)
        self.request_headers = self._conn.request_headers
        self.request_header = self._conn.request_header
        return self

    def __getattribute__(self, name):
        """
        Redirect attribute access to the local HTTPConnection() instance.
        """
        if name == "_conn":
            return object.__getattribute__(self, name)
        else:
            return getattr(self._conn, name)


class HTTPHandler2(urllib2.HTTPHandler):
    """
    A HTTPHandler which stores the request headers.
    Used HTTPConnection3, see above.

    >>> opener = urllib2.build_opener(HTTPHandler2)
    >>> opener.addheaders = [("User-agent", "Python test")]
    >>> response = opener.open('http://www.python.org/')

    Get the request headers as a list build with HTTPConnection.putheader():
    >>> response.request_headers
    [('Accept-Encoding', 'identity'), ('Host', 'www.python.org'), ('Connection', 'close'), ('User-Agent', 'Python test')]

    >>> response.request_header
    'GET / HTTP/1.1\\r\\nAccept-Encoding: identity\\r\\nHost: www.python.org\\r\\nConnection: close\\r\\nUser-Agent: Python test\\r\\n\\r\\n'
    """
    def http_open(self, req):
        conn_instance = HTTPConnection3()
        response = self.do_open(conn_instance, req)
        response.request_headers = conn_instance.request_headers
        response.request_header = conn_instance.request_header
        return response

EDIT: Update the source

see urllib2.py:do_request (line 1044 (1067)) and urllib2.py:do_open (line 1073) (line 293) self.addheaders = [('User-agent', client_version)] (only 'User-agent' added)

It sounds to me like you're looking for the headers of the response object, which include Connection: close, etc. These headers live in the object returned by urlopen. Getting at them is easy enough:

from urllib2 import urlopen
req = urlopen("http://www.google.com")
print req.headers.headers

req.headers is a instance of httplib.HTTPMessage

It should send the default http headers (as specified by w3.org) alongside the ones you specify. You can use a tool like WireShark if you would like to see them in their entirety.

Edit:

If you would like to log them, you can use WinPcap to capture packets sent by specific applications (in your case, python). You can also specify the type of packets and many other details.

-John

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top