Question

I am building a webcrawler that gets 1-3 pages off a list of millions of domains, I am using Python with multi threading, i have tried multithreading with httplib, httplib2, urllib, urllib2, urllib3, requests, and curl(fastest of the bunch) as well as twisted, and scrapy but none of them are allowing me to use up more than about 10 mbits of bandwidth( I have 60 mbit speed), usually maxes out at around 100-300 threads and after that it causes failed requests. I have also had this problem with php/curl. I have a scraper that scrapes from google plus pages with urllib3 and the Threads module (Python) and that maxes out my 100mbit connection ( I believe this may be because it is re-using an open socket with the same host and google has a fast network response)

here is an example of one of my scripts using pycurl I am reading the urls from a csv file containing the urls.

import pycurl
from threading import Thread
from Queue import Queue
import cStringIO


def get(readq,writeq):
    buf = cStringIO.StringIO()
    while True:
        url=readq.get()

        c = pycurl.Curl()
        c.setopt(pycurl.TIMEOUT, 15)
        c.setopt(pycurl.FOLLOWLOCATION, 1)
        c.setopt(pycurl.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0')
        c.setopt(c.WRITEFUNCTION, buf.write)
        c.setopt(c.URL, url)
        try:
            c.perform()
            writeq.put(url+'  '+str(c.getinfo(pycurl.HTTP_CODE)))
        except:
            writeq.put('error  '+url)
print('hi')
readq=Queue()
writeq=Queue()

import csv
reader=csv.reader(open('alldataunq2.csv'))
sites = []
ct=0
for l in reader:
    if l[3] != '':
        readq.put('http://'+l[3])
        ct+=1
        if ct > 100000:
            break

t=[]
for i in range(100):
    Thread(target=get,args=(readq,writeq)).start()

while True:
    print(writeq.get())

the bottleneck is definitely network IO as my processor/memory is barely being used. Has anyone had success in writing a similar scraper that was able to use a full 100mbit connection or more?

any input on how I can increase the speed of my scraping code is greatly appreciated

No correct solution

OTHER TIPS

There are several factors you need to keep in mind when optimizing crawling speed.

Connection locality

In order to re-use connections effectively, you need to make sure that you're reusing connections for the same website. If you wait too long to hit an earlier host a second time, the connection could time out and that's no good. Opening new sockets is a relatively expensive operation so you want to avoid it at all costs. A naive heuristic to achieve this is to sort your download targets by host and download one host at a time, but then you run into the next problem...

Spreading the load between hosts

Not all hosts have fat pipes, so you'll want to hit multiple hosts simultaneously—this also helps avoiding spamming a single host too much. A good strategy here is to have multiple workers, where each worker focuses on one host at a time. This way you can control the rate of downloads per host within the context of each worker, and each worker will maintain its own connection pool to reuse connections from.

Worker specialization

One way to ruin your throughput is to mix your data processing routines (parse the HTML, extract links, whatever) with the fetching routines. A good strategy here is to do the minimal amount of processing work in the fetching workers, and simply save the data for a separate set of workers to pick up later and process (maybe on another machine, even).

Keeping these things in mind, you should be able to squeeze more out of your connection. Some unrelated suggestions: Consider using wget, you'd be surprised at how effective it is in doing simple crawls (it can even read from a giant manifest file).

I don't think you can expect to get anywhere near your internet connection's max throughput when doing web scraping.

Scraping (and web browsing in general) involves making a lot of small requests. A good deal of that time is spent in connection set-up and tear down, and waiting on the remote end to begin delivering your content. I'd guess that the time spent actively downloading content is probably around 50%. If you were downloading a bunch of big files, then I think you'd see better average throughput.

Try scrapy with scrapy-redis.

You will have to tune the settings: CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS_PER_IP. Also make sure you have DOWNLOAD_DELAY = 0 and AUTOTHROTTLE_ENABLED = False.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top