Finding all links under a website [closed]

Question 1

Since you don't have any links pointing from the homepage to other pages. The only information you have is the URL of the homepage and the URL pattern used for accessing the users.

The only possibility I see it then to use a dictionary of names, or all possible string permutations with a limited length. And then accessing every single one of them.

Have a look at http://docs.python.org/2/library/urllib.html in order to see how you can create HTTP requests/open URLs.

Then write a loop that iterates through all names/string permutations and calls the URL.

You might consider doing this concurrently, which should be relatively easy to achieve: http://en.wikipedia.org/wiki/Embarrassingly_parallel

Question 2

Instead of writing a python script first you can use third-party tools for that purpose.

1) First step create mirror of target web site, than parse it. You can create mirror of target web site with "wget -mk http://www.targetwebsite.com"

http://www.gnu.org/software/wget/

Than you can parse mirror codes with python.

2) Google dorking example queries:

site:stackoverflow.com
site:stackoverflow.com intitle:"index of"
site:stackoverflow.com ext:html ext:php

It will work correctly for paths which don't have robots.txt file.

Question 3

Check the 'mechanize' python module. It is very handy to use. This pages helped me a lot - great source of the information: http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/ https://scraperwiki.com/views/python_mechanize_cheat_sheet/edit/

Just to give you a taste of what 'mechanize' can do, here is a simple function that will list all the links from given url:

import mechanize

def test():
    headers = ('User-Agent','Mozilla/4.0')
    url = 'http://www.google.com'

    browser = mechanize.Browser()
    browser.addheaders=[headers]

    response = browser.open(url)
    for link in browser.links():
        print link

if __name__ == '__main__':
    test()