سؤال

I am using the feedparser library in Python to get the various details from an RSS feed. Suppose I have pulled out 25 headlines titles from an RSS feed of a news channel. After an hour I run the feedparser command again to get the latest list of the titles of the 25 new headlines. The list might or not be updated the second time I run the feedparser command.

Some of the headlines might be same and some might be new. I need to be able to check whether there has been an update in any of the news headlines with the headlines that was pulled out the hour earlier. Only the new headlines must be pushed into a database. This is to avoid duplicate getting dumped into the database.

The code looks like below:

import feedparser
d = feedparser.parse('www.news.example.xml')
for item in d.entries:
    hndlr.write(item.title)  #data being dumped into a database

I need to be able to run the above code every hour and check if there was any update in the headlines (title). And if there was any change with the data extracted the hour earlier, only the new data should be dumped into the database.

هل كانت مفيدة؟

المحلول

Each feed item has an identifier, in item.id. Track those, together with their .updated (or .updated_parsed) entry, to check for new items.

So, see if you already have seen the item (via item.id) or if it has been updated since the last time you checked (via item.updated or item.updated_parsed).

Do make sure you take advantage of the feedparser E-Tag support to check for changed feed contents though. This will only save you from downloading feeds with no new items; you still need to detect items have been added or updated when you get a fresh new copy of the feed.

نصائح أخرى

For "good" feeds you can use ETag and last-modfied-since mechanism, it's described here http://www.kbcafe.com/rss/rssfeedstate.html

But some servers doesn't support it, so you need to simply check post dates and ids and see, do you have such posts in your DB or not.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top