質問

Before laying my question bare, some context is needed. I'm trying to issue HTTP GET and POST requests to a website, with the following caveats:

  • Redirects are expected
  • Cookies are required
  • Requests must pass through a SOCKS proxy (v4a)

Up until now, I've been using twisted.web.client.Agent and it's subclasses (e.g. BrowserLikeRedirectAgent), but unfortunately it seems as though SOCKS proxies are not supported yet (and ProxyAgent is a no-go because this class is for HTTP proxies).

I stumbled upon twisted-socks, which seems to allow me to do what I want, but I noticed that it uses HttpClientFactory instead of agent, hence my question: what is the difference between HttpClientFactory and Agent and when should I use each one?

Below is some example code using twisted-socks. I have two additional questions:

  1. How can I use cookies in this example? I tried passing a dict and a cookielib.CookieJar instance to HttpClientFactory's cookies kwarg, but this raises an error (something about a string being expected... how on earth do I send cookies as a string?)

  2. Can this code be refactored to use Agent? This would be ideal, as I already have a reasonably large codebase that is written with Agent in mind.

```

import sys
from urlparse import urlparse
from twisted.internet import reactor, endpoints
from socksclient import SOCKSv4ClientProtocol, SOCKSWrapper
from twisted.web import client

class mything:
    def __init__(self):
        self.npages = 0
        self.timestamps = {}

    def wrappercb(self, proxy):
        print "connected to proxy", proxy

    def clientcb(self, content):
        print "ok, got: %s" % content[:120]
        print "timetamps " + repr(self.timestamps)
        self.npages -= 1
        if self.npages == 0:
            reactor.stop()

    def sockswrapper(self, proxy, url):
        dest = urlparse(url)
        assert dest.port is not None, 'Must specify port number.'
        endpoint = endpoints.TCP4ClientEndpoint(reactor, dest.hostname, dest.port)
        return SOCKSWrapper(reactor, proxy[1], proxy[2], endpoint, self.timestamps)

def main():
    thing = mything()

    # Mandatory first argument is a URL to fetch over Tor (or whatever
    # SOCKS proxy that is running on localhost:9050).
    url = sys.argv[1]
    proxy = (None, 'localhost', 9050, True, None, None)

    f = client.HTTPClientFactory(url)
    f.deferred.addCallback(thing.clientcb)
    sw = thing.sockswrapper(proxy, url)
    d = sw.connect(f)
    d.addCallback(thing.wrappercb)
    thing.npages += 1

    reactor.run()

if '__main__' == __name__:
    main()

```

役に立ちましたか?

解決

I think you typically wouldn't use a HTTPClientFactory, as it seems it's just a thing that does HTTP requests and not much more. It's pretty low-level.

If you just want to fire a request, there are functions (twisted.web.client.getPage and .downloadPage) that construct the factory for you, handling both HTTP and HTTPS.

Agent is a thing that gives you a higher level abstraction: it keeps a connection pool, handles the HTTP/HTTPS choice based on the url, handles proxies etc. And right, this is the thing you usually want to use.

It seems they they don't share much code and Agent is to HTTP11ClientProtocol (and HTTP11ClientFactory) as getPage is to the old HTTPClientFactory (and its protocol, HTTPPageGetter). So there's a twisted.web.client vs ._newclient (with the Agent as its public API) duality. Historical reasons and backward compatibility, I'd guess.

Anyway, this library won't be nice to mix with Agent out of the box, because the API is broken. twisted-socks's SOCKSWrapper declares it implements the IStreamClientEndpoint interface, but the interface demands the .connect method returns a deffered that will fire with an IProtocol provider (see docs), while SOCKSWrapper returns one that fires with the address (here's the line that does this). It seems you can easily fix it changing the line to:

self.handshakeDone.callback(self.transport.protocol)

Once you do that, you should be able to use twisted-socks with Agent. Here's an example: (using inlineCallbacks and the new react, but you could just as well use the standard .addCallback with deferreds and reactor.run())

from twisted.internet.endpoints import TCP4ClientEndpoint
from twisted.internet.defer import inlineCallbacks
from twisted.internet.task import react
from twisted.web.client import ProxyAgent, readBody

from socksclient import SOCKSWrapper

@react
@inlineCallbacks
def main(reactor):
    target = TCP4ClientEndpoint(reactor, 'example.com', 80)
    proxy = SOCKSWrapper(reactor, 'localhost', 9050, target)
    agent = ProxyAgent(proxy)
    request = yield agent.request('GET', 'http://example.com/')
    print (yield readBody(request))

Also, there's a txsocksx library that seems to be nicer to use (and is pip-installable!). The API is pretty much the same, however you pass the target endpoint where you would pass the proxy endpoint before:

from twisted.internet.endpoints import TCP4ClientEndpoint
from twisted.internet.defer import inlineCallbacks
from twisted.internet.task import react
from twisted.web.client import ProxyAgent, readBody

from txsocksx.client import SOCKS5ClientEndpoint

@react
@inlineCallbacks
def main(reactor):
    proxy = TCP4ClientEndpoint(reactor, 'localhost', 9050)
    proxied_endpoint = SOCKS5ClientEndpoint('example.com', 80, proxy)
    agent = ProxyAgent(proxied_endpoint)
    request = yield agent.request('GET', 'http://example.com/')
    print (yield readBody(request))
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top