PHP - how to get main HTML content like Reader Mode in Firefox

Question 1

Hooray!!!

I found this source code:

3) create index.php by this code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
    <head>
        <title>!</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
<body dir="rtl">
<?php
include_once 'Readability.php';


// get latest Medialens alert 
// (change this URL to whatever you'd like to test)
$url = 'http://';
$html = file_get_contents($url);

// Note: PHP Readability expects UTF-8 encoded content.
// If your content is not UTF-8 encoded, convert it 
// first before passing it to PHP Readability. 
// Both iconv() and mb_convert_encoding() can do this.

// If we've got Tidy, let's clean up input.
// This step is highly recommended - PHP's default HTML parser
// often doesn't do a great job and results in strange output.
if (function_exists('tidy_parse_string')) {
    $tidy = tidy_parse_string($html, array(), 'UTF8');
    $tidy->cleanRepair();
    $html = $tidy->value;
}

// give it to Readability
$readability = new Readability($html, $url);
// print debug output? 
// useful to compare against Arc90's original JS version - 
// simply click the bookmarklet with FireBug's console window open
$readability->debug = false;
// convert links to footnotes?
$readability->convertLinksToFootnotes = true;
// process it
$result = $readability->init();
// does it look like we found what we wanted?
if ($result) {
    echo "== Title =====================================\n";
    echo $readability->getTitle()->textContent, "\n\n";
    echo "== Body ======================================\n";
    $content = $readability->getContent()->innerHTML;
    // if we've got Tidy, let's clean it up for output
    if (function_exists('tidy_parse_string')) {
        $tidy = tidy_parse_string($content, array('indent'=>true, 'show-body-only' => true), 'UTF8');
        $tidy->cleanRepair();
        $content = $tidy->value;
    }
    echo $content;
} else {
    echo 'Looks like we couldn\'t find the content. :(';
}
?>
</body>
</html>

in $url = 'http://'; set your site url.

Thank you;)

Question 2

A new PHP library named PHP Goose seems to do a very good job at this too. It's pretty easy to use and is Composer friendly.

Here's a usage example given on the actual readme :

use Goose\Client as GooseClient;

$goose = new GooseClient();
$article = $goose->extractContent('http://url.to/article');

$title = $article->getTitle();
$metaDescription = $article->getMetaDescription();
$metaKeywords = $article->getMetaKeywords();
$canonicalLink = $article->getCanonicalLink();
$domain = $article->getDomain();
$tags = $article->getTags();
$links = $article->getLinks();
$movies = $article->getMovies();
$articleText = $article->getCleanedArticleText();
$entities = $article->getPopularWords();
$image = $article->getTopImage();
$allImages = $article->getAllImages();

Question 3

Readability.php works pretty well but I've found you get more successful results if you curl for the html content and spoof the user agent. You can also use some redirect forwarding in case the url you are trying to hit is giving you the runaround. Here is what I'm using now slightly modified from another post (PHP Curl following redirects). Hope you find it useful.

function getData($url) {
    $url = str_replace('&amp;', '&', urldecode(trim($url)) );
    $timeout = 5;
    $cookie = tempnam('/tmp', 'CURLCOOKIE');
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1');
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_ENCODING, '');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
    $content = curl_exec($ch);
    curl_close ($ch);
    return $content;
}

Implementation:

$url = 'http://';
//$html = file_get_contents($url);
$html = getData($url);

if (function_exists('tidy_parse_string')) {
    $tidy = tidy_parse_string($html, array(), 'UTF8');
    $tidy->cleanRepair();
    $html = $tidy->value;
}

$readability = new Readability($html, $url);

//...

Question 4

There is no such built-in function in PHP. I am afraid will have to parse and analyse the HTML document yourself. You will probably need to use some XML parser, the SimpleXML library is a good candidate.

I am not familiar with the "Reader mode" feature you are referring to, but a good starting point would probably be removing all <img> contents. The actual "cleanning" algorithm it uses is certainly not trivial at all, and it seems it is actually implemented as a call to a third party, closed soure, service in Javascript.

Question 5

this is to display the whole content if you want more information about this just search in Google about regular expression and how to get value between tags in a html file i will tell you why with a demo :)

first off, when you use function file get contents you will get the file with html code but the server or browser will display it like a page look at this code,

$html = file_get_contents('http://coder-dz.com');
preg_match_all('/<li>(.*?)<\/li>/s', $html, $matches);
foreach($matches[1] as $mytitle)
{
echo $mytitle."<br/>";
}

well what i did here? i get the content of my website is word press i get titles because title they are in a tag of HTML li after that i used regular expression to get the values between this tags.

i hope you get my point because I’m not at English, if you have any question feel free to ask me