I'm a novice Python programmer trying to use Python to scrape a large amount of pages from fanfiction.net and deposit a particular line of the page's HTML source into a .csv file. My program works fine, but eventually hits a snag where it stops running. My IDE told me that the program has encountered "Errno 10054: an existing connection was forcibly closed by the remote host".
I'm looking for a way to get my code to reconnect and continue every time I get the error. My code will be scraping a few hundred thousand pages every time it runs; is this maybe just too much for the site? The site doesn't appear to prevent scraping. I've done a fair amount of research on this problem already and attempted to implement a retry decorator, but the decorator doesn't seem to work. Here's the relevant section of my code:
def retry(ExceptionToCheck, tries=4, delay=3, backoff=2, logger=None):
def deco_retry(f):
@wraps(f)
def f_retry(*args, **kwargs):
mtries, mdelay = tries, delay
while mtries > 1:
try:
return f(*args, **kwargs)
except ExceptionToCheck as e:
msg = "%s, Retrying in %d seconds..." % (str(e), mdelay)
if logger:
logger.warning(msg)
else:
print(msg)
time.sleep(mdelay)
mtries -= 1
mdelay *= backoff
return f(*args, **kwargs)
return f_retry # true decorator
return deco_retry
@retry(urllib.error.URLError, tries=4, delay=3, backoff=2)
def retrieveURL(URL):
response = urllib.request.urlopen(URL)
return response
def main():
# first check: 5000 to 100,000
MAX_ID = 600000
ID = 400001
URL = "http://www.fanfiction.net/s/" + str(ID) + "/index.html"
fCSV = open('buffyData400k600k.csv', 'w')
fCSV.write("Rating, Language, Genre 1, Genre 2, Character A, Character B, Character C, Character D, Chapters, Words, Reviews, Favorites, Follows, Updated, Published, Story ID, Story Status, Author ID, Author Name" + '\n')
while ID <= MAX_ID:
URL = "http://www.fanfiction.net/s/" + str(ID) + "/index.html"
response = retrieveURL(URL)
Whenever I run the .py file outside of my IDE, it eventually locks up and stops grabbing new pages after about an hour, tops. I'm also running a different version of the same file in my IDE, and that appears to have been running for almost 12 hours now, if not longer-is it possible that the file could work in my IDE but not when run independently?
Have I set my decorator up wrong? What else could I potentially do to get python to reconnect? I've also seen claims that the SQL native client being out of date could cause problems for a Window user such as myself - is this true? I've tried to update that but had no luck.
Thank you!