I'm writing a little crawler that should fetch a URL multiple times, I want all of the threads to run at the same time (simultaneously).

I've written a little piece of code that should do that.

import thread
from urllib2 import Request, urlopen, URLError, HTTPError


def getPAGE(FetchAddress):
    attempts = 0
    while attempts < 2:
        req = Request(FetchAddress, None)
        try:
            response = urlopen(req, timeout = 8) #fetching the url
            print "fetched url %s" % FetchAddress
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            try:
                return response.read()
            except:
                "there was an error with response.read()"
                return None
    return None

url = ("http://www.domain.com",)

for i in range(1,50):
    thread.start_new_thread(getPAGE, url)

from the apache logs it doesn't seems like the threads are running simultaneously, there's a little gap between requests, it's almost undetectable but I can see that the threads are not really parallel.

I've read about GIL, is there a way to bypass it with out calling a C\C++ code? I can't really understand how does threading is possible with GIL? python basically interpreters the next thread as soon as it finishes with the previous one?

Thanks.

有帮助吗?

解决方案

As you point out, the GIL often prevents Python threads from running in parallel.

However, that's not always the case. One exception is I/O-bound code. When a thread is waiting for an I/O request to complete, it would typically have released the GIL before entering the wait. This means that other threads can make progress in the meantime.

In general, however, multiprocessing is the safer bet when true parallelism is required.

其他提示

I've read about GIL, is there a way to bypass it with out calling a C\C++ code?

Not really. Functions called through ctypes will release the GIL for the duration of those calls. Functions that perform blocking I/O will release it too. There are other similar situations, but they always involve code outside the main Python interpreter loop. You can't let go of the GIL in your Python code.

You can use an approach like this to create all threads, have them wait for a condition object, and then have them start fetching the url "simultaneously":

#!/usr/bin/env python
import threading
import datetime
import urllib2

allgo = threading.Condition()

class ThreadClass(threading.Thread):
    def run(self):
        allgo.acquire()
        allgo.wait()
        allgo.release()
        print "%s at %s\n" % (self.getName(), datetime.datetime.now())
        url = urllib2.urlopen("http://www.ibm.com")

for i in range(50):
    t = ThreadClass()
    t.start()

allgo.acquire()
allgo.notify_all()
allgo.release()

This would get you a bit closer to having all fetches happen at the same time, BUT:

  • The network packets leaving your computer will pass along the ethernet wire in sequence, not at the same time,
  • Even if you have 16+ cores on your machine, some router, bridge, modem or other equipment in between your machine and the web host is likely to have fewer cores, and may serialize your requests,
  • The web server you're fetching stuff from will use an accept() call to respond to your request. For correct behavior, that is implemented using a server-global lock to ensure only one server process/thread responds to your query. Even if some of your requests arrive at the server simultaneously, this will cause some serialisation.

You will probably get your requests to overlap to a greater degree (i.e. others starting before some finish), but you're never going to get all of your requests to start simultaneously on the server.

You can also look at things like the future of pypy where we will have software transitional memory (thus doing away with the GIL) This is all just research and intellectual scoffing at the moment but it could grow into something big.

If you run your code with Jython or IronPython (and maybe PyPy in the future), it will run in parallel

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top