crawler instances

https://stackoverflow.com/questions/1023563

06-07-2019
|

Question

im building a large-scale web crawler, how many instances is optimal when crawling a web when running it on dedicated web server located in internet server farms.

Solution

spare_memory_on_machine / memory_footprint_of_crawler_process * 0.95

OTHER TIPS

To make a large scale crawler you will have to deal with some issues like:

• Impossibility to keep info all in one database.

• Not enough RAM to deal with huge index(s)

• Multithread performance and concurrency

• Crawler traps (infinite loop created by changing urls, calendars, sessions ids...) and duplicated content.

• Crawl from more than one computer

• Malformed HTML codes

• Constant http errors from servers

• Databases without compression, wich make your need for space about 8x bigger.

• Recrawl routines and priorities.

• Use requests with compression (Deflate/gzip) (good for any kind of crawler).

And some important things

• Respect robots.txt

• And a crawler delay on each request to dont suffocate web servers.

The optimal Thread configuration will depend on your code.. i'am running 100 process with .net. I recommend you to use a schedule class to avoid unnecessary open threads.

PS. If you are using 5 threads, you will take years to reach "large scale" web crawling.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow