How to detect if a page is an RSS or ATOM feed

https://stackoverflow.com/questions/2442984

19-09-2019
|

Question

I'm currently building a new online Feed Reader in PHP. One of the features i'm working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper tag.

The problem is, the way im currently detecting if the URL is a feed or a website only works part of the time, and I know it can't be the best solution. Right now im taking the CURL response and running it through simplexml_load_string, if it can't parse it I treat it as a website. Here is the code.

$xml = @simplexml_load_string( $site_found['content'] );

if( !$xml ) // this is a website, not a feed
{
    // handle website
}
else
{
    // parse feed
}

Obviously, this isn't ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.

Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?

Thanks,

Pepper http://feedingo.com

Solution

I would sniff for the various unique identifiers those formats have:

Atom: Source

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

RSS 0.90: Source

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/">

Netscape RSS 0.91

<rss version="0.91">

etc. etc. (See the 2nd source link for a full overview).

As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed> and <rss> tags, respectively. Plus you won't find those in a valid HTML document.

You could make an initial check to tell HTML and feeds apart by looking for <html> and <body> elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)

If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.

what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.

OTHER TIPS

I think your best choice is getting the Content-Type header as I assume that's the way firefox (or any other browser) does it. Besides, if you think about it, the Content-Type is indeed the way server tells user agents how to process the response content. Almost any decent HTTP server sends a correct Content-Type header.

Nevertheless you could try to identify rss/atom in the content as a second choice if the first one "fails"(this criteria is up to you).

An additional benefit is that you only need to request the header instead of the entire document, thus saving you bandwidth, time, etc. You can do this with curl like this:

<?php
 $ch = curl_init("http://sample.com/feed");
 curl_setopt($ch, CURLOPT_NOBODY, true); // this set the HTTP Request Method to HEAD instead GET(default) and the server only sends HTTP Header(no content).
 curl_exec($ch);
 $conType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

 if (is_rss($conType)){ // You need to implement is_rss($conType) function
    // TODO
 }elseif(is_html($conType)) { // You need to implement is_html($conType) function
    // Search a rss in html
 }else{
    // Error : Page has no rss/atom feed
 }
?>

Why not try to parse your data with a component built specifically to parse RSS/ATOM Feed, like Zend_Feed_Reader ?

With that, if the parsing succeeds, you'll be pretty sure that the URL you used is indeed a valid RSS/ATOM feed.

And I should add that you could use such a component to parse feed in order to extract their informations, too : no need to re-invent the wheel, parsing the XML "by hand", and dealing with special cases yourself.

Pepper,

Use the Content-Type HTTP response header to dispatch to the right handler.

Jan

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow