Question

I want to get all the urls under a domain.
When I looked at their robots.txt. It clearly states that some of the folders are not for robots but I am wondering is there a way to get the all the urls that are open to robots. There is no sitemap on the robots.txt.

For example, on their robots.txt, it has the information looks similar like this:

User-agent: *
Allow: /
Disallow: /A/
Disallow: /B/
Disallow: /C/
...

But I am interested in all the urls available to the robots but not included in this blacklist, like

/contact
/welcome
/product1
/product2
...

Any idea will be appreicated and I am also curious if there will be a Yahoo Query Language(YQL) solution for this problem because this work has probably already been done by Yahoo.

Thanks !

Was it helpful?

Solution

yes there is a way to get all the urls open to robots.

A simple solution would be to go to www.google.com and type site:www.website.com into the search bar.

While that isn't a guarantee to get you every page it will get you all the pages google has indexed. And google adhered to robots.txt so it seems to fit your purpose.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top