문제

I'm working on a project where we work with a distributed crawler to crawl and download hosts found with web content on them. We have a few million hosts at this point, but we're realizing it's not the least expensive thing in the world. Crawling takes time and computing power, etc. etc. So instead of doing this ourselves, we're looking into if we can leverage an outside service to get URLs.

My question is, are there services out there that provide massive lists of web hosts and/or just massive lists of constantly updated URLS (which we can then parse to get hosts)? Stuff I've already looked into:

1) Search engine APIs - typically all of these search engine APIs will (understandably) not just let you download their entire index.

2) DMOZ and Alexa top 1 million - These don't have near enough sites for what we are looking to do, though they're a good start for seed lists.

Anyone have any leads? How would you solve the problem?

도움이 되었습니까?

해결책

Maybe CommonCrawl helps. http://commoncrawl.org/ Common Crawl is a huge open database of crawled websites.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top