Вопрос

I'd like a list of the top 100,000 domain names sorted by the number of distinct, public web pages.

The list could look something like this

Domain Name         100,000,000 pages
Domain Name          99,000,000 pages
Domain Name          98,000,000 pages
...

I don't want to know which domains are the most popular. I want to know which domains have the highest number of distinct, publicly accessible web pages.

I wasn't able to find such a list in Google. I assume Quantcast, Google or Alexa would know, but have they published such a list?

Это было полезно?

Решение

For a given domain, e.g. yahoo.com you can google-search site:yahoo.com; at the top of the results it says "About 141,000,000 results (0.41 seconds)". This includes subdomains like www.yahoo.com, and it.yahoo.com.

Note also that some websites generate pages on the fly, so they might, in fact, have infinite "pages". A given page will be calculated when asked for, and forgotten as soon as it is sent. Each can have a link to the next page. Since many websites compose their pages on the fly, there is no real difference (except that there are infinite pages, which you can't find out unless you ask for them all).

Другие советы

Keep in mind a few things:

  • Many websites generate pages dynamically, leaving a potentially infinite number of pages.
  • Pages are often behind security barriers.
  • Very few companies are interested in announcing how much information they maintain.
  • Indexes go out of date as they're created.

What I would be inclined to do for specific answers is mirror the sites of interest using wget and count the pages.

wget -m --wait=9 --limit-rate=10K http://domain.test

Keep it slow, so that the company doesn't recognize you as a Denial of Service attack.

Most search engines will allow you to search their index by site, as well, though the information on result pages might be confusing for more than a rough order of magnitude and there's no way to know how much they've indexed.

I don't see where they keep or have access to the database at a glance, but down the search engine path, you might also be interested in the Seeks and YaCy search engine projects.

The only organization I can think of that might (a) have the information easily available and (b) be friendly and transparent enough to want to share it would be the folks at The Internet Archive. Since they've been archiving the web with their Wayback Machine for a long time and are big on transparency, they might be a reasonable starting point.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top