Question

I am trying to teach myself Python by writing a very simple web crawler with it.

The code for it is here:

#!/usr/bin/python

import sys, getopt, time, urllib, re

LINK_INDEX = 1
links = [sys.argv[len(sys.argv) - 1]]
visited = []
politeness = 10
maxpages = 20

def print_usage():
    print "USAGE:\n./crawl [-politeness <seconds>] [-maxpages <pages>] seed_url"

def parse_args():
    #code for parsing arguments (works fine so didnt need to be included here)

def crawl():
    global links, visited
    url = links.pop()    
    visited.append(url)

    print "\ncurrent url: %s" % url

    response = urllib.urlopen(url)
    html = response.read()

    html = html.lower()

    raw_links = re.findall(r'<a href="[\w\.-]+"', html)

    print "found: %d" % len(raw_links)

    for raw_link in raw_links:
        temp = raw_link.split('"')
        if temp[LINK_INDEX] not in visited and temp[LINK_INDEX] not in links:
            links.append(temp[LINK_INDEX])

    print "\nunvisited:"
    for link in links:
        print link

    print "\nvisited:"
    for link in visited:
        print link

parse_args()

while len(visited) < maxpages and len(links) > 0:
    crawl()
    time.sleep(politeness)

print "politeness = %d, maxpages = %d" % (politeness, maxpages)

I created a small test network in the same working directory of about 10 pages that all link together in various ways, and it seems to work fine, but when I send it out onto the actual internet by itself, it is unable to parse links from files it gets.

It is able to get the html code fine, because I can print that out, but it seems that the re.findall() part is not doing what it is supposed to, because the links list never gets populated. Have I maybe written my regex wrong? It worked fine to find strings like <a href="test02.html" and then parse the link from that, but for some reason, it isn't working for actual web pages. It might be the http part perhaps that is throwing it off?

I've never used regex with Python before so I'm pretty sure that this is the problem. Can anyone give me any idea how express the pattern I am looking for better? Thanks!

Was it helpful?

Solution

The problem is with your regex. There are a whole bunch of ways I could write a valid HTML anchor that your regex wouldn't match. For example, there could be extra whitespace, or line breaks in it, and there are other attributes that could exist that you haven't taken into account. Also, you take no account of different case. For example:

<a  href="foo">foo</a>

<A HREF="foo">foo</a>

<a class="bar" href="foo">foo</a>

None of these would be matched by your regex.

You probably want something more like this:

<a[^>]*href="(.*?)"

This will match an anchor tag start, followed by any characters other than > (so that we're still matching inside the tag). This might be things like a class or id attribute. The value of the href attribute is then captured in a capture group, which you can extract by

match.group(1)

The match for the href value is also non-greedy. This means it will match the smallest match possible. This is because otherwise if you have other tags on the same line, you'll match beyond what you want to.

Finally, you'll need to add the re.I flag to match in a case insensitive way.

OTHER TIPS

Your regexp doesn't match all valid values for the href attributes, such as path with slashes, and so on. Using [^"]+ (anything different from the closing double quote) instead of [\w\.-]+ would help, but it doesn't matter because… you should not parse HTML with regexps to begin with.

Lev already mentionned BeautifulSoup, you could also look at lxml. It will work better that any hand-crafted regexp you could write.

You probably want this:

raw_links = re.findall(r'<a href="(.+?)"', html)

Use the brackets to indicate what you want returned, otherwise you get the whole match including the <a href=... bit. Now you get everything until the closing quote mark, due to the use of a non-greedy +? operator.

A more discriminating filter might be:

raw_links = re.findall(r'<a href="([^">]+?)"', html)

this matches anything except a quote and a terminating bracket.

These simple RE's will match to URL's that have been commented, URL-like literal strings inside bits of javascript, etc. So be careful about using the results!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top