Question

I am using gevent to download some html pages. Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".

timeout = Timeout(10)
timeout.start()

def downloadSite():
    # code to download site's url one by one
    url1 = downloadUrl()
    url2 = downloadUrl()
    url3 = downloadUrl()
try:
    gevent.spawn(downloadSite).join()
except Timeout:
    print 'Lost state here'

But the problem with it is that i loose all the state when exception fires up.

Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.

The question is - how do I save state and process the data even if Timeout happens ?

Was it helpful?

Solution

Why not try something like:

timeout = Timeout(10)

def downloadSite(url):
    with Timeout(10):
        downloadUrl(url)

urls = ["url1", "url2", "url3"]

workers = []
limit = 5
counter = 0
for i in urls:
    # limit to 5 URL requests at a time
    if counter < limit:
        workers.append(gevent.spawn(downloadSite, i))
        counter += 1
    else:
        gevent.joinall(workers)
        workers = [i,]
        counter = 0
gevent.joinall(workers)

You could also save a status in a dict or something for every URL, or append the ones that fail in a different array, to retry later.

OTHER TIPS

A self-contained example:

import gevent
from gevent import monkey
from gevent import Timeout

gevent.monkey.patch_all()
import urllib2

def get_source(url):
    req = urllib2.Request(url)
    data = None
    with Timeout(2):
        response = urllib2.urlopen(req)
        data = response.read()
    return data

N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]

print contents[5]

It implements one timeout for each request. In this example, contents contains 10 times the HTML source of google.com, each retrieved in an independent request. If one of the requests had timed out, the corresponding element in contents would be None. If you have questions about this code, don't hesitate to ask in the comments.

I saw your last comment. Defining one timeout per request definitely is not wrong from the programming point of view. If you need to throttle traffic to the website, then just don't spawn 100 greenlets simultaneously. Spawn 5, wait until they returned. Then, you can possibly wait for a given amount of time, and spawn the next 5 (already shown in the other answer by Gabriel Samfira, as I see now). For my code above, this would mean, that you would have to repeatedly call

N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]

whereas N should not be too high.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top