Question

I am working currently working on a spider; but I need to be able to call the Spider() function more than once to follow links, here is my code:

import httplib, sys, re

def spider(target, link):
        try:
        conn = httplib.HTTPConnection(target)
        conn.request("GET", "/")
        r2 = conn.getresponse()
        data = r2.read().split('\n')
        for x in data[:]:
            if link in x:
                a=''.join(re.findall("href=([^ >]+)",x))
                a=a.translate(None, '''"'"''')
                if a:
                    return a
    except:
        exit(0)

print spider("www.yahoo.com", "http://www.yahoo.com")

but I only get 1 link from the output, how can I make this all the links?

also how can I get the subsite from the links so the spider can follow them?

Was it helpful?

Solution

This is probably closer to what you're looking for

import httplib, sys, re

def spider(link, depth=0):
    if(depth > 2): return []

    try:
        conn = httplib.HTTPConnection(link)
        conn.request("GET", "/")
        r2 = conn.getresponse()
        data = r2.read().split('\n')
        links = []
        for x in data[:]:
            if link in x:
                a=''.join(re.findall("href=([^ >]+)",x))
                a=a.translate(None, '"' + "'")
                if a:
                    links.append(a)

        # Recurse for each link
        for link in links:
            links += spider(link, (depth + 1))

        return links

    except:
        exit(1)

print spider("http://www.yahoo.com")

It's untested, but the basics are there. Scrape all the links, then recursively crawl them. The function returns a list of links on the page on each call. And when a page is recursively crawled, those links that are returned by the recursive call are added to this list. The code also has a max recursion depth so you don't go forever.

It's missing some obvious oversights, like cycle detection.

A few sidenotes, there are better ways to do some of this stuff.

For example, urllib2 can fetch webpages for you a lot easier than using httplib.

And BeautifulSoup extracts links from web pages better than your regex + translate kluge.

OTHER TIPS

Following doorknob's hint, if you just change the return a to yield a, your function becomes a generator. Instead of calling it and getting back a result, you call it and get back an iterator—something you can loop over.

So, change your if block to this:

if link in x:
    a=''.join(re.findall("href=([^ >]+)",x))
    a=a.translate(None, '''"'"''')
    if a:
        yield a

Then change your print statement to this:

for a in spider("www.yahoo.com", "http://www.yahoo.com"):
    print a

And you're done.

However, I'm guessing you didn't really want to join up the findall; you wanted to loop over each "found" thing separately. How do you fix that? Easy, just loop around the re.findall, and yield once per loop:

if link in x:
    for a in re.findall("href=([^ >]+)",x)):
        a=a.translate(None, '''"'"''')
        if a:
            yield a

For a more detailed explanation of how generators and iterators work, see this presentation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top