How to 'scrape' content from a page's source? [closed]

https://stackoverflow.com/questions/7321474

PHP
scrape

27-10-2019
|

Question

I have this code which gets the HTML source of a page:

$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);

I want to scrape some content from it. For example, say the page's source contains this:

<strong>technorati.com</strong><br />
Connection failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />

Is there a way I could scrape this from the source and store it in a variable, so it'll look like this:

technorati.com Connection failed
icerocket.com Connection failed
eblogs.com Done
Ect.

Of cause the page is dynamic which is why I'm having a problem. Could I maybe search for each site in the source? But then how would I get the result which is after it? (Connection failed / Done)
Thanks a lot for the help!

Solution

I have tried scraping multiple sites using the simple HTML DOM PHP library, which can be obtained here: http://simplehtmldom.sourceforge.net/

Then using code like this:

<?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
}
?>

This results in something like:

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents

OTHER TIPS

This isn't the best solution, but it works:

$page = file_get_contents('http://example.com/page.html');
preg_match_all('#<strong>([^<]+)</strong><br />\s*([^<]+)<#', $page, 
                                             $result, PREG_SET_ORDER);
foreach ($result as $row) {
    echo "<p><b>$row[1]</b> $row[2]</p>\n";
}

If need to scape something more complex, consider DOMDocument.

You can use Regular Expressions.

Edit

Regex isn't the best solution for large problems, but for simple pages with a standard format, regex is often simplest to use.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow