Question

I've been working on a webcrawler in python using beautifulsoup and ran into a couple problems:

  1. I have no idea how to handle errors like 404s, 503s or anything else like that: currently, the webcrawler just breaks the program execution

  2. I have no idea how to search for a specific string in the page, such as if I wanted it to print out pages containing the string "Python"

If anyone has any input on how I could accomplish either of those, or may be able to push me in the right direction, it would be appreciated.

Currently my code is this:

    import urllib.request, time, unicodedata
    from bs4 import BeautifulSoup
    num = 0
    def index():
        index = open('index.html', 'w')
        for x in range(len(titles)-1):
                index.write("<a href="+'"'+tocrawl[x]+'"'+" "+"target=" "blank"" >"+titles[x+1]+"</a></br>\n")
        index.close()
        return 'Index Created'


    def crawl(args):
        page = urllib.request.urlopen(args).read()
        soup = BeautifulSoup(page)
        soup.prettify().encode('UTF-8')
        titles.append(str(soup.title.string.encode('utf-8'),encoding='utf-8'))
        for anchor in soup.findAll('a', href=True):
            if str(anchor['href']).startswith(https) or str(anchor['href']).startswith(http):
                if anchor['href'] not in tocrawl:
                    if anchor['href'].endswith(searchfor):
                            print(anchor['href'])
                    if not anchor['href'].endswith('.png') and not anchor['href'].endswith('.jpg'):
                        tocrawl.append(anchor['href'])

    tocrawl, titles, descriptions, scripts, results = [], [], [], [], []
    https = 'https://'
    http = 'http://'
    next = 3
    crawl('http://google.com/')
    while 1:
        crawl(tocrawl[num])
        num = num + 1
        if num==next:
            index()
            next = next + 3

I am using Python 3.2, in case it matters

Was it helpful?

Solution

Handling error codes:
When you attempt to open a URL and encounter an error, you will get an HTTPError which conveniently includes the HTTP status code, and a reason (e.g. some string). If you want to ignore errors, you can wrap your function in a try / except block and ignore the error:

try:
    page = urllib.request.urlopen(args).read()
    # ...
except urllib.error.HTTPError as e:
    # we don't care about no stinking errors
    # ... but if we did, e.code would have the http status code...
    # ... and e.reason would have an explanation of the error (hopefully)
    pass

Search a page for a string:
Beautiful soup is incredibly powerful; its find method (and its find_all method) supports a keyword argument text which uses a regular expression to find the text in a page. In your case, since you just need to ensure that the text exists, you can likely get away with just ensuring that a result is returned via the find method.

if soup.find(text=re.compile('my search string')):
    # do something

More details on the text argument can be found in the documentation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top