Question

Following on from my question in the comment to this answer to the question "Gevent pool with nested web requests":

Assuming one has a large number of tasks, is there any downside to using gevent.spawn(...) to spawn all of them simultaneously rather than using a gevent pool and pool.spawn(...) to limit the number of concurrent greenlets?

Formulated differently: is there any advantage to "limiting concurrency" with a gevent.Pool even if not required by the problem to be solved?

Any idea what would constitute a "large number" for this issue?

Was it helpful?

Solution

It's just cleaner and a good practice when dealing with a lot of stuff. I ran into this a few weeks ago I was using gevent spawn to verify a bunch of emails against DNS on the order of 30k :).

from gevent.pool import Pool
import logging
rows = [ ... a large list of stuff ...]
CONCURRENCY = 200 # run 200 greenlets at once or whatever you want
pool = Pool(CONCURRENCY)
count = 0

def do_work_function(param1,param2):
   print param1 + param2

for row in rows:
  count += 1 # for logging purposes to track progress
  logging.info(count)
  pool.spawn(do_work_function,param1,param2) # blocks here when pool size == CONCURRENCY

pool.join() #blocks here until the last 200 are complete

I found in my testing that when CONCURRENCY was around 200 is when my machine load would hover around 1 on a EC2 m1.small. I did it a little naively though, if I were to do it again I'd run multiple pools and sleep some time in between them to try to distribute the load on the NIC and CPU more evenly.

One last thing to keep in mind is keeping an eye on your open files and increasing that if need be: http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files. The greenlets I was running were taking up around 5 file descriptors per greenlet so you can run out pretty quickly if you aren't careful. This may not be helpful if your system load is above one as you'll start seeing diminishing returns regardless.

OTHER TIPS

Came here from Google and decided to run a few quick tests to spawn increasing N greenlets. Sharing the results as they might be useful to fellow searchers:

# 1 greenlet
real    0m1.032s
user    0m0.017s
sys     0m0.009s

# 100 greenlets
real    0m1.037s
user    0m0.021s
sys     0m0.010s

# 1,000 greenlets
real    0m1.045s
user    0m0.035s
sys     0m0.013s

# 10,000 greenlets
real    0m1.232s
user    0m0.265s
sys     0m0.059s

# 100,000 greenlets
real    0m3.992s
user    0m3.201s
sys     0m0.444s

So up to 1,000 greenlets and the performance loss is tiny, but once you start hitting 10,000+ greenlets, everything slows down.

Test code:

import gevent

N = 0

def test():
    gevent.sleep(1)

while N < 1000:
  N += 1
  gevent.spawn(test)

gevent.wait()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top