Question

I'm a novice Python programmer trying to use Python to scrape a large amount of pages from fanfiction.net and deposit a particular line of the page's HTML source into a .csv file. My program works fine, but eventually hits a snag where it stops running. My IDE told me that the program has encountered "Errno 10054: an existing connection was forcibly closed by the remote host".

I'm looking for a way to get my code to reconnect and continue every time I get the error. My code will be scraping a few hundred thousand pages every time it runs; is this maybe just too much for the site? The site doesn't appear to prevent scraping. I've done a fair amount of research on this problem already and attempted to implement a retry decorator, but the decorator doesn't seem to work. Here's the relevant section of my code:

def retry(ExceptionToCheck, tries=4, delay=3, backoff=2, logger=None):

    def deco_retry(f):

        @wraps(f)
        def f_retry(*args, **kwargs):
            mtries, mdelay = tries, delay
            while mtries > 1:
                try:
                    return f(*args, **kwargs)
                except ExceptionToCheck as e:
                    msg = "%s, Retrying in %d seconds..." % (str(e), mdelay)
                    if logger:
                        logger.warning(msg)
                    else:
                        print(msg)
                    time.sleep(mdelay)
                    mtries -= 1
                    mdelay *= backoff
            return f(*args, **kwargs)

        return f_retry  # true decorator

    return deco_retry


@retry(urllib.error.URLError, tries=4, delay=3, backoff=2)
def retrieveURL(URL):
    response = urllib.request.urlopen(URL)
    return response


def main():
    # first check: 5000 to 100,000 
    MAX_ID = 600000
    ID = 400001
    URL = "http://www.fanfiction.net/s/" + str(ID) + "/index.html"
    fCSV = open('buffyData400k600k.csv', 'w')
    fCSV.write("Rating, Language, Genre 1, Genre 2, Character A, Character B, Character C,     Character D, Chapters, Words, Reviews, Favorites, Follows, Updated, Published, Story ID, Story Status, Author ID, Author Name" + '\n')    

    while ID <= MAX_ID:

        URL = "http://www.fanfiction.net/s/" + str(ID) + "/index.html"
        response = retrieveURL(URL)

Whenever I run the .py file outside of my IDE, it eventually locks up and stops grabbing new pages after about an hour, tops. I'm also running a different version of the same file in my IDE, and that appears to have been running for almost 12 hours now, if not longer-is it possible that the file could work in my IDE but not when run independently?

Have I set my decorator up wrong? What else could I potentially do to get python to reconnect? I've also seen claims that the SQL native client being out of date could cause problems for a Window user such as myself - is this true? I've tried to update that but had no luck.

Thank you!

Was it helpful?

Solution

You are catching URLErrors, which Errno: 10054 is not, so your @retry decorator is not going to retry. Try this.

@retry(Exception, tries=4)
def retrieveURL(URL):
    response = urllib.request.urlopen(URL)
    return response

This should retry 4 times on any Exception. Your @retry decorator is defined correctly.

OTHER TIPS

Your code for reconnecting looks good except for one part - the exception that you're trying to catch. According to this StackOverflow question, an Errno 10054 is a socket.error. All you need to do is to import socket and add an except socket.error statement in your retry handler.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top