Question

I'm working on a python 2.7 script that must check in a Fedora Commons repository for the existence of some data in 20'000 objects. Basically this means sending 20'000 HTTP requests to 20'000 different urls on the repository (that runs on a Tomcat server).

I wrote a script that does the job, but I've been warned by the server system administrator that it opens too many network connections, which causes some troubles.

My script uses so far urllib2 to make the HTTP requests.

response         = urllib2.urlopen(url)
response_content = response.read()

And actually this code opens one new network connection per request.

I have tried to use other libraries to make the requests, but could not find any way to reuse the same connection for all requests. Both solutions below still open many network connections, even if their number is really lower (actually both solutions seem to open one connection for 100 HTTP requests, which is still around 200 connections in my case).

httplib:

url       = "http://localhost:8080/fedora/objects/test:1234?test="
url_infos = urlparse(url)
conn      = httplib.HTTPConnection(url_infos.hostname + ":" + str(url_infos.port))

for x in range(0, 20000):
    myurl = url + str(x)
    conn.request("GET", myurl)
    r = conn.getresponse()
    response_content = r.read()
    print x, "\t", myurl, "\t", r.status

requests:

url = "http://localhost:8080/fedora/objects/test:1234?test="
s   = requests.Session()

for x in range(0, 20000):       
    myurl = url + str(x)
    r = s.get(myurl)
    response_content = r.content
    print x, "\t", myurl, "\t", r.status_code

Even if the number of connections is much better, ideally I'd like to use one or very few connections for all requests. Is that even possible ? Is this number of 100 requests per connection related to the system or to the server ? By the way I also tried to make the requests pointing to an Apache server and the result was the same.

Was it helpful?

Solution

The fact that both solutions shared some code like Lukasa said, and the fact that both results were equivalent whenever querying Apache or Tomcat made me first think it was related to the Python code. But in fact it was related to the servers configurations.

The trick is that both Apache and Tomcat share a setting which indicates how many HTTP requests can be made within the same TCP connection. And both have a default value of 100.

Tomcat:

maxKeepAliveRequests:

    The maximum number of HTTP requests which can be pipelined until the connection is closed by the server.
    If not specified, this attribute is set to 100.

See http://tomcat.apache.org/tomcat-7.0-doc/config/http.html#Standard_Implementation

Apache:

MaxKeepAliveRequests:

    The MaxKeepAliveRequests directive limits the number of requests allowed per connection when KeepAlive is on
    Default:    MaxKeepAliveRequests 100

See http://httpd.apache.org/docs/2.2/en/mod/core.html#maxkeepaliverequests

By modifying these values only a very few connections can be created indeed

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top