Your issue is that you are passing in the last modified date in place of the etag
. The etag
is the second argument to the parse()
method, modified
is the third argument.
Instead of:
d2=feedparser.parse(feed,modified)
Do:
d2=feedparser.parse(feed,modified=modified)
After taking a look at the source code, it looks like the only thing passing etag
or modified
to the parse()
function does is send the appropriate headers to the server so that the server can return an empty response if nothing has changed. If the server does not support this then the server will just return the full RSS feed. I would modify your code to check the dates of each entry and ignore one with a date that is smaller than the max date in the previous request:
import feedparser
rsslist=["http://skottieyoung.tumblr.com/rss", "http://mrjakeparker.com/feed/"]
def feed_modified_date(feed):
# this is the last-modified value in the response header
# do not confuse this with the time that is in each feed as the server
# may be using a different timezone for last-resposne headers than it
# uses for the publish date
modified = feed.get('modified')
if modified is not None:
return modified
return None
def max_entry_date(feed):
entry_pub_dates = (e.get('published_parsed') for e in feed.entries)
entry_pub_dates = tuple(e for e in entry_pub_dates if e is not None)
if len(entry_pub_dates) > 0:
return max(entry_pub_dates)
return None
def entries_with_dates_after(feed, date):
response = []
for entry in feed.entries:
if entry.get('published_parsed') > date:
response.append(entry)
return response
for feed_url in rsslist:
print('--------%s-------' % feed_url)
d = feedparser.parse(feed_url)
print('feed length %i' % len(d.entries))
if len(d.entries) > 0:
etag = d.feed.get('etag', None)
modified = feed_modified_date(d)
print('modified at %s' % modified)
d2 = feedparser.parse(feed_url, etag=etag, modified=modified)
print('second feed length %i' % len(d2.entries))
if len(d2.entries) > 0:
print("server does not support etags or there are new entries")
# perhaps the server does not support etags or last-modified
# filter entries ourself
prev_max_date = max_entry_date(d)
entries = entries_with_dates_after(d2, prev_max_date)
print('%i new entries' % len(entries))
else:
print('there are no entries')
This produces:
--------http://skottieyoung.tumblr.com/rss-------
feed length 20
modified at None
second feed length 20
server does not support etags or there are new entries
0 new entries
--------http://mrjakeparker.com/feed/-------
feed length 10
modified at Wed, 07 Nov 2012 19:27:48 GMT
second feed length 0
there are no entries