Question
im building a large-scale web crawler, how many instances is optimal when crawling a web when running it on dedicated web server located in internet server farms.
Solution
spare_memory_on_machine / memory_footprint_of_crawler_process * 0.95
OTHER TIPS
To make a large scale crawler you will have to deal with some issues like:
• Impossibility to keep info all in one database.
• Not enough RAM to deal with huge index(s)
• Multithread performance and concurrency
• Crawler traps (infinite loop created by changing urls, calendars, sessions ids...) and duplicated content.
• Crawl from more than one computer
• Malformed HTML codes
• Constant http errors from servers
• Databases without compression, wich make your need for space about 8x bigger.
• Recrawl routines and priorities.
• Use requests with compression (Deflate/gzip) (good for any kind of crawler).
And some important things
• Respect robots.txt
• And a crawler delay on each request to dont suffocate web servers.
The optimal Thread configuration will depend on your code.. i'am running 100 process with .net. I recommend you to use a schedule class to avoid unnecessary open threads.
PS. If you are using 5 threads, you will take years to reach "large scale" web crawling.