Python: Optimal algorithm to avoid downloading unchanged pages while crawling

https://stackoverflow.com/questions/7605710

05-02-2021
|

Question

I am writing a crawler which regularly inspects a list of news websites for new articles. I have read about different approaches for avoiding unnecessary pages downloads, basically identified 5 header elements that could be useful to determine if the page has changed or not:

HTTP Status
ETAG
Last_modified (to combine with If-Modified-Since request)
Expires
Content-Length.

The excellent FeedParser.org seems to implement some of these approaches.

I am looking for an optimal code in Python (or any similar language) that makes this kind of decision. Keep in mind that header info is always provided by the server.

That could be something like :

def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
    #retrieve the headers, do the magic here and return the decision
    return decision

Solution

The only thing you need to check before making the request is Expires. If-Modified-Since is not something the server sends you, but something you send the server.

What you want to do is an HTTP GET with an If-Modified-Since header indicating when you last retrieved the resource. If you get back status code 304 rather than the usual 200, the resource has not been modified since then, and you should use your stored copy (a new copy will not be sent).

Additionally, you should retain the Expires header from the last time you retrieved the document, and not issue the GET at all if your stored copy of the document has not expired.

Translating this into Python is left as an exercise, but it should be straightforward to add an If-Modified-Since header to a request, to store the Expires header from the response, and to check the status code from the response.

OTHER TIPS

You would need to pass in a dict of headers to shouldDownload (or the result of urlopen):

def shouldDownload(url, headers, prev_etag, prev_lastmod, prev_expires,  prev_content_length):
    return (prev_content_length != headers.get("content-length") || prev_lastmod != headers.get("If-Modified-Since") || prev_expires != headers.get("Expires") || prev_etag != headers.get("ETAG"))
    # or the optimistic way:
    # return prev_content_length == headers.get("content-length") and prev_lastmod == headers.get("If-Modified-Since") and prev_expires = headers.get("Expires") and prev_etag = headers.get("ETAG")

Do that when you open the URL:

# my urllib2 is a little fuzzy but I believe `urlopen()` doesn't 
#  read the whole file until `.read()` is called, and you can still 
#  get the headers with `.headers`.  Worst case is you may have to 
#  `read(50)` or so to get them.
s = urllib2.urlopen(MYURL)
try:
    if shouldDownload(s.headers):
        source = s.read()
        # do stuff with source
   else:
        continue
# except HTTPError, etc if you need to  
finally:
    s.close()

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow