Question

I am working on a small project to get information from several webpages based on the HTML Markup of the page, and I do not know where to start at all.

The basic idea is of getting the Title from <h1></h1>s, and content from the <p></p>s tags and other important information that is required.

I would have to setup each case from each source for it to work the way it needs. I believe the right method is using $_GET method with PHP. The goal of the project is to build a database of information.

What is the best method to grab the information which I need?

Was it helpful?

Solution

First of all: PHP's $_GET is not a method. As you can see in the documentation $_GET is simply an array initialized with the GET's parameters your web server received during the current query. As such it is not what you want to use for this kind of things.

What you should look into is cURL that allows you to compose even fairly complex query, send to the destination server and retrieve the response. For example for a POST request you could do something like:

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL,"http://www.mysite.com/tester.phtml");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,
            "postvar1=value1&postvar2=value2&postvar3=value3");

// in real life you should use something like:
// curl_setopt($ch, CURLOPT_POSTFIELDS, 
//          http_build_query(array('postvar1' => 'value1')));

// receive server response ...
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$server_output = curl_exec ($ch);

curl_close ($ch);

Source

Of course if you don't have to do any complex query but simple GET requests you can go with the PHP function file_get_contents

After you received the web page content you have to parse it. IMHO the best way to do this is by using PHP's DOM functions. How to use them should really be another question, but you can find tons of example without much effort.

OTHER TIPS

  <?php
 $remote = file_get_contents('http://www.remote_website.html');
 $doc = new DomDocument();
 $file = @$doc->loadHTML($remote);  
 $cells = @$doc->getElementsByTagName('h1');

 foreach($cells AS $cell)
 {


    $titles[] = $cell->nodeValue ;

}

 $cells = @$doc->getElementsByTagName('p');

foreach($cells AS $cell)
 {

    $content[] = $cell->nodeValue ;

}

 ?> 

You can get the HTML source of a page with:

<?php
$html= file_get_contents('http://www.example.com/');
echo $html;
?>

Then once you ahve the structure of the page you get the request tag with substr() and strpos()

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top