سؤال

I was wondering how Google Reader extracts news items from a web page.

Does any of you know how it works? Or how someone can build a similar system to extract the same information from the HTML of a web page.

Obviously it is not using a standard (nor is it only reading RSS/ATOM), because Google Reader proves that it can read the content of the page regardless of how the markup looks like.

هل كانت مفيدة؟

المحلول

Google Reader does not currently do any kind of extraction of content from raw web pages. It used to have a "track changes to arbitrary pages" feature, but that was removed more than a year ago.

When given an URL that is not that of a feed, Google Reader fetches its contents. If the contents are HTML, it looks for an autodiscovery element of the form <link rel="alternate" type="application/atom+xml" href="feed.xml">. If found, it subscribes to the feed.

نصائح أخرى

You already answered your question by tagging your question with "RSS".

Anyway, Google Reader like all other RSS/Atom-Readers read an RSS or an Atom feed. You may want to have a look at the corresponding wikipedia article: http://en.wikipedia.org/wiki/RSS

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top