Question

I'm working on a project where we work with a distributed crawler to crawl and download hosts found with web content on them. We have a few million hosts at this point, but we're realizing it's not the least expensive thing in the world. Crawling takes time and computing power, etc. etc. So instead of doing this ourselves, we're looking into if we can leverage an outside service to get URLs.

My question is, are there services out there that provide massive lists of web hosts and/or just massive lists of constantly updated URLS (which we can then parse to get hosts)? Stuff I've already looked into:

1) Search engine APIs - typically all of these search engine APIs will (understandably) not just let you download their entire index.

2) DMOZ and Alexa top 1 million - These don't have near enough sites for what we are looking to do, though they're a good start for seed lists.

Anyone have any leads? How would you solve the problem?

Was it helpful?

Solution

Maybe CommonCrawl helps. http://commoncrawl.org/ Common Crawl is a huge open database of crawled websites.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top