Question

Issue:
Cannot fully understand the Goutte web scraper.

Request:
Can someone please help me understand or provide code to help me better understand how to use Goutte the web scraper? I have read over the README.md. I am looking for more information than what that provides such as what options are available in Goutte and how to write those options or when you are looking at forms do you search for the name= or the id= of the form?

Webpage Layout attempting to be scraped:
Step 1:
The webpage has a form has a radio button to choose what kind of form to fill out (ie. Name or License). It is defaulted to Name with First and Last Name textboxes along with a State drop down menu select list. If you choose Radio there is jQuery or JavaScript that makes the First and Last Name textboxes go away and a License Textbox appears.

Step 2:
Once you have successfully submitted the form then it brings you to a page that has multiple links. We can go in to one of two of them to get our information we need.

Step 3:
Once we have successfully clicked on the link we want the third page has the data that we are looking for and we want to store that data into a php variable.

Submitting Incorrect information:
If wrong information is submitted then a jQuery/Javascript returns a message of "No records were found." on the same page as the submission.

Note:
The preferred method would be to select the license radio button, fill in the license number, choose the state and then submit the form. I have read tons of posts and blogs and other items about Goutte and nowhere can I find what options are available for Goutte, how you find out this information or how to use this information if it did exist.

Was it helpful?

Solution

The documentation you want to look at is the Symfony2 DomCrawler.

Goutte is a client build on top of Guzzle that returns Crawlers every time you request/submit something:

use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'http://www.symfony-project.org/');

With this crawler you can do stuff like get all the P tags inside the body:

$nodeValues = $crawler->filter('body > p')->each(function (Crawler $node, $i) {
    return $node->text();
});
print_r($nodeValues);

Fill and submit forms:

$form = $crawler->selectButton('sign in')->form(); 
$crawler = $client->submit($form, array(
        'username' => 'username', 
        'password' => 'xxxxxx'
));

A selectButton() method is available on the Crawler which returns another Crawler that matches a button (input[type=submit], input[type=image], or a button) with the given text. [1]

You click on links or set options, select check-boxes and more, see Form and Link support.

To get data from the crawler use the html or text methods

echo $crawler->html();
echo $crawler->text();

OTHER TIPS

After much trial and error I have discovered that there is a much easier, well documented, better assitance (if needed) and much more effective scraper than goutte. If you are having issues with goutte try the following:

  1. Simple HTML Dom: http://simplehtmldom.sourceforge.net/

If you are in the same situation as I was where the page you are trying to scrape requires a referrer from their own website then you can use a combination of CURL and Simple HTML DOM because it does not appear that Simple HTML DOM has the ability to send a referrer. If you do not need a referrer then you can use Simple HTML DOM to scrape the page.

$url="http://www.example.com/sub-page-needs-referer/";
$referer="http://www.example.com/";
$html=new simple_html_dom(); // Create a new object for SIMPLE HTML DOM
/** cURL Initialization  **/
$ch = curl_init($url);

/** Set the cURL options **/
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_REFERER,$referer);
$output = curl_exec($ch);

if($output === FALSE) {
  echo "cURL Error: ".curl_error($ch); // do something here if we couldn't scrape the page
}
else {
  $info = curl_getinfo($ch);
  echo "Took ".$info['total_time']." seconds for url: ".$info['url'];
  $html->load($output); // Transfer CURL to SIMPLE HTML DOM
}

/** Free up cURL **/
curl_close($ch);

// Do something with SIMPLE HTML DOM.  It is well documented and very easy to use.  They have a lot of examples.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top