Question

I've got two functions that work just fine, but seem to break down when I run them nested together.

def scrape_all_pages(alphabet):
    pages = get_all_urls(alphabet)
    for page in pages:
        scrape_table(page)

I'm trying to systematically scrape some search results. So get_all_pages() creates a list of URLs for each letter in the alphabet. Sometimes there are thousands of pages, but that works just fine. Then, for each page, scrape_table scrapes just the table I'm interested in. That also works fine. I can run the whole thing and it works fine, but I'm working in Scraperwiki and if I set it to run and walk away it invariably gives me a "list index out of range" error. This is definitely an issue within scraperwiki, but I'd like to find a way to zero in on the problem by adding some try/except clauses and logging errors when I encounter them. Something like:

def scrape_all_pages(alphabet):
    try:
        pages = get_all_urls(alphabet)
    except:
        ## LOG THE ERROR IF THAT FAILS.
    try:
        for page in pages:
            scrape_table(page)
    except:
        ## LOG THE ERROR IF THAT FAILS

I haven't been able to figure out how to generically log errors, though. Also, the above looks clunky and in my experience when something looks clunky, Python has a better way. Is there a better way?

Was it helpful?

Solution 3

It is a good way, but. You should not use just except clause, you have to specify the type of the exception you are trying to catch. Also you can catch an error and continue the loop.

def scrape_all_pages(alphabet):
    try:
        pages = get_all_urls(alphabet)
    except IndexError: #IndexError is an example
        ## LOG THE ERROR IF THAT FAILS.

    for page in pages:
        try:
            scrape_table(page)
        except IndexError: # IndexError is an example
            ## LOG THE ERROR IF THAT FAILS and continue this loop

OTHER TIPS

You can specify a certain type of Exceptions to catch and a variable to hold the exception instance:

def scrape_all_pages(alphabet):
    try:
        pages = get_all_urls(alphabet)
        for page in pages:
            scrape_table(page)
    except OutOfRangeError as error:
        # Will only catch OutOfRangeError
        print error
    except Exception as error:
        # Will only catch any other exception
        print error

Catching the type Exception will catch all errors as they are supposedly all inheriting from Exception.

This is the only way I know of for catching errors.

Wrap the logging information around a context manager, like this though you can easily change the details to meet your requirements:

import traceback

# This is a context manager
class LogError(object):
    def __init__(self, logfile, message):
        self.logfile = logfile
        self.message = message
    def __enter__(self):
        return self
    def __exit__(self, type, value, tb):
        if type is None or not issubclass(type, Exception):
            # Allow KeyboardInterrupt and other non-standard exception to pass through
            return

        self.logfile.write("%s: %r\n" % (self.message, value))
        traceback.print_exception(type, value, tb, file=self.logfile)
        return True # "swallow" the traceback

# This is a helper class to maintain an open file object and
# a way to provide extra information to the context manager.
class ExceptionLogger(object):
    def __init__(self, filename):
        self.logfile = open(filename, "wa")
    def __call__(self, message):
        # override function() call so that I can specify a message
        return LogError(self.logfile, message)

The key part is that __exit__ can return 'True', in which case the exception is ignored, and the program continues to carry on. The code also needs to be a bit careful, since a KeyboardInterrupt (control-C), SystemExit, or other non-standard exception might be raised, and where you actually do want the program to stop.

You can use the above in your code like this:

elog = ExceptionLogger("/dev/tty")

with elog("Can I divide by 0?"):
    1/0

for i in range(-4, 4):
    with elog("Divisor is %d" % (i,)):
        print "5/%d = %d" % (i, 5/i)

That snippet gives me the output:

Can I divide by 0?: ZeroDivisionError('integer division or modulo by zero',)
Traceback (most recent call last):
  File "exception_logger.py", line 24, in <module>
    1/0
ZeroDivisionError: integer division or modulo by zero
5/-4 = -2
5/-3 = -2
5/-2 = -3
5/-1 = -5
Divisor is 0: ZeroDivisionError('integer division or modulo by zero',)
Traceback (most recent call last):
  File "exception_logger.py", line 28, in <module>
    print "5/%d = %d" % (i, 5/i)
ZeroDivisionError: integer division or modulo by zero
5/1 = 5
5/2 = 2
5/3 = 1

I think it's also easy to see how one might modify the code to handle logging only IndexError exceptions, or even to pass in the base exception type to catch.

Maybe log the error for each iteration, so that an error in one iteration doesn't break your loop:

for page in pages:
    try:
        scrape_table(page)
    except:
        #open error log file for append:
        f=open("errors.txt","a")
        #write error to file:
        f.write("Error occured\n") # some message specific to this iteration (page) should be added here...
        #close error log file:
        f.close()

its better to write it like this:

    try:
        pages = get_all_urls(alphabet)
    except IndexError:
        ## LOG THE ERROR IF THAT FAILS.
    for page in pages:
        try:
            scrape_table(page)
        except IndexError:
            continue ## this will bring you to the next item in for
        ## LOG THE ERROR IF THAT FAILS
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top