Using re.findall() in Python for Web Crawling

Question 1

The problem is with your regex. There are a whole bunch of ways I could write a valid HTML anchor that your regex wouldn't match. For example, there could be extra whitespace, or line breaks in it, and there are other attributes that could exist that you haven't taken into account. Also, you take no account of different case. For example:

<a  href="foo">foo</a>

<A HREF="foo">foo</a>

<a class="bar" href="foo">foo</a>

None of these would be matched by your regex.

You probably want something more like this:

<a[^>]*href="(.*?)"

This will match an anchor tag start, followed by any characters other than > (so that we're still matching inside the tag). This might be things like a class or id attribute. The value of the href attribute is then captured in a capture group, which you can extract by

match.group(1)

The match for the href value is also non-greedy. This means it will match the smallest match possible. This is because otherwise if you have other tags on the same line, you'll match beyond what you want to.

Finally, you'll need to add the re.I flag to match in a case insensitive way.

Question 2

Your regexp doesn't match all valid values for the href attributes, such as path with slashes, and so on. Using [^"]+ (anything different from the closing double quote) instead of [\w\.-]+ would help, but it doesn't matter because… you should not parse HTML with regexps to begin with.

Lev already mentionned BeautifulSoup, you could also look at lxml. It will work better that any hand-crafted regexp you could write.

Question 3

You probably want this:

raw_links = re.findall(r'<a href="(.+?)"', html)

Use the brackets to indicate what you want returned, otherwise you get the whole match including the <a href=... bit. Now you get everything until the closing quote mark, due to the use of a non-greedy +? operator.

A more discriminating filter might be:

raw_links = re.findall(r'<a href="([^">]+?)"', html)

this matches anything except a quote and a terminating bracket.

These simple RE's will match to URL's that have been commented, URL-like literal strings inside bits of javascript, etc. So be careful about using the results!