I am working currently working on a spider; but I need to be able to call the Spider() function more than once to follow links, here is my code:

import httplib, sys, re

def spider(target, link):
        try:
        conn = httplib.HTTPConnection(target)
        conn.request("GET", "/")
        r2 = conn.getresponse()
        data = r2.read().split('\n')
        for x in data[:]:
            if link in x:
                a=''.join(re.findall("href=([^ >]+)",x))
                a=a.translate(None, '''"'"''')
                if a:
                    return a
    except:
        exit(0)

print spider("www.yahoo.com", "http://www.yahoo.com")

but I only get 1 link from the output, how can I make this all the links?

also how can I get the subsite from the links so the spider can follow them?

有帮助吗?

解决方案

This is probably closer to what you're looking for

import httplib, sys, re

def spider(link, depth=0):
    if(depth > 2): return []

    try:
        conn = httplib.HTTPConnection(link)
        conn.request("GET", "/")
        r2 = conn.getresponse()
        data = r2.read().split('\n')
        links = []
        for x in data[:]:
            if link in x:
                a=''.join(re.findall("href=([^ >]+)",x))
                a=a.translate(None, '"' + "'")
                if a:
                    links.append(a)

        # Recurse for each link
        for link in links:
            links += spider(link, (depth + 1))

        return links

    except:
        exit(1)

print spider("http://www.yahoo.com")

It's untested, but the basics are there. Scrape all the links, then recursively crawl them. The function returns a list of links on the page on each call. And when a page is recursively crawled, those links that are returned by the recursive call are added to this list. The code also has a max recursion depth so you don't go forever.

It's missing some obvious oversights, like cycle detection.

A few sidenotes, there are better ways to do some of this stuff.

For example, urllib2 can fetch webpages for you a lot easier than using httplib.

And BeautifulSoup extracts links from web pages better than your regex + translate kluge.

其他提示

Following doorknob's hint, if you just change the return a to yield a, your function becomes a generator. Instead of calling it and getting back a result, you call it and get back an iterator—something you can loop over.

So, change your if block to this:

if link in x:
    a=''.join(re.findall("href=([^ >]+)",x))
    a=a.translate(None, '''"'"''')
    if a:
        yield a

Then change your print statement to this:

for a in spider("www.yahoo.com", "http://www.yahoo.com"):
    print a

And you're done.

However, I'm guessing you didn't really want to join up the findall; you wanted to loop over each "found" thing separately. How do you fix that? Easy, just loop around the re.findall, and yield once per loop:

if link in x:
    for a in re.findall("href=([^ >]+)",x)):
        a=a.translate(None, '''"'"''')
        if a:
            yield a

For a more detailed explanation of how generators and iterators work, see this presentation.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top