Frage

How can I find all directories and links under a website? Note that there is no link from the home page to all other pages. For example if I have a domain like this users.company.com and each user will have a link like users.company.com/john, users.company.com/peter etc. But I don't know how many users are there what are all the links and I want to brute force to check all links. If I want to write a python script to do this job, where can I get information to develop this in python?

War es hilfreich?

Lösung

Since you don't have any links pointing from the homepage to other pages. The only information you have is the URL of the homepage and the URL pattern used for accessing the users.

The only possibility I see it then to use a dictionary of names, or all possible string permutations with a limited length. And then accessing every single one of them.

Have a look at http://docs.python.org/2/library/urllib.html in order to see how you can create HTTP requests/open URLs.

Then write a loop that iterates through all names/string permutations and calls the URL.

You might consider doing this concurrently, which should be relatively easy to achieve: http://en.wikipedia.org/wiki/Embarrassingly_parallel

Andere Tipps

Instead of writing a python script first you can use third-party tools for that purpose.

1) First step create mirror of target web site, than parse it. You can create mirror of target web site with "wget -mk http://www.targetwebsite.com"

http://www.gnu.org/software/wget/

Than you can parse mirror codes with python.

2) Google dorking example queries:

site:stackoverflow.com
site:stackoverflow.com intitle:"index of"
site:stackoverflow.com ext:html ext:php

It will work correctly for paths which don't have robots.txt file.

Check the 'mechanize' python module. It is very handy to use. This pages helped me a lot - great source of the information: http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/ https://scraperwiki.com/views/python_mechanize_cheat_sheet/edit/

Just to give you a taste of what 'mechanize' can do, here is a simple function that will list all the links from given url:

import mechanize

def test():
    headers = ('User-Agent','Mozilla/4.0')
    url = 'http://www.google.com'

    browser = mechanize.Browser()
    browser.addheaders=[headers]

    response = browser.open(url)
    for link in browser.links():
        print link

if __name__ == '__main__':
    test()
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top