Question

I have a crawler that crawl a few different domains for new posts/content. The total amount of content is hundred of thousands of pages, and there is a lot of new content added each day. So to be able to crawl through all this content, I need my crawler to be crawling 24/7.

Currently I host the crawler script on the same server as the site the crawler is adding the content to, and I'm only able to run a cronjob to run the script during nighttime, because when I do, the website basically stops working because the load of the script. In other words, a pretty crappy solution.

So basically I wonder what my best option is for this kind of solution?

  • Is it possible to keep running the crawler from the same host, but somehow balancing the load so that the script doesnt kill the website?

  • What kind of host/server would I be looking for to host a crawler? Is there any other specifications I need than a normal web host?

  • The crawler saves images that it crawls. If I host my crawler on a secondary server, how do I save my images on the server of my site? I guess I dont want CHMOD 777 on my uploads-folder and allow anyone to put files on my server.

Was it helpful?

Solution

I decided to choose Amazon Web Services to host my crawler where they both have SQS for queues but also auto scalable instances. It also have S3 where I can store all my images.

I also decided to rewrite my whole crawler to Python instead of PHP to more easily take advantage of things such as queues and to keep the app going 100% of the time, instead of using cronjobs.

So what I did, and what it means

  1. I set up a Elastic Beanstalk Application for my crawler that is set to "Worker" and listening to a SQS where I store all the domains that need to be crawled. An SQS is a "queue" where I can save each domain that needs to be crawled, and the crawler will listen to the queue and fetch one domain at a time until the queue is done. There is no need for "cronjobs" or anything like that, as soon as the queue get data into it, it will send it to the crawler. Meaning the crawler is up 100% of the time, 24/7.

  2. The Application is set to auto scaling, meaning that when I have too many domains in the queue, it will set up a second, third, fourth etc... instance/crawler to speed up the process. I think this is a very very very important point for anyone that wants to set up a crawler.

  3. All images are saved on a S3 instance. This means that the images are not saved on the server of the crawler and can easily be fetched and worked with.

The results have been great. When I had a PHP Crawler running on cronjobs every 15min, I could crawl about 600 urls per hour. Now I can without problems crawl 10'000+ urls per hour, even more depending on how I set my auto scaling.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top