Question

I'm looking for a library that has functionality similar to Perl's WWW::Mechanize, but for PHP. Basically, it should allow me to submit HTTP GET and POST requests with a simple syntax, and then parse the resulting page and return in a simple format all forms and their fields, along with all links on the page.

I know about CURL, but it's a little too barebones, and the syntax is pretty ugly (tons of curl_foo($curl_handle, ...) statements

Clarification:

I want something more high-level than the answers so far. For example, in Perl, you could do something like:

# navigate to the main page
$mech->get( 'http://www.somesite.com/' ); 

# follow a link that contains the text 'download this'
$mech->follow_link( text_regex => qr/download this/i );

# submit a POST form, to log into the site
$mech->submit_form(
    with_fields      => {
        username    => 'mungo',
        password    => 'lost-and-alone',
    }
);

# save the results as a file
$mech->save_content('somefile.zip');

To do the same thing using HTTP_Client or wget or CURL would be a lot of work, I'd have to manually parse the pages to find the links, find the form URL, extract all the hidden fields, and so on. The reason I'm asking for a PHP solution is that I have no experience with Perl, and I could probably build what I need with a lot of work, but it would be much quicker if I could do the above in PHP.

Was it helpful?

Solution

SimpleTest's ScriptableBrowser can be used independendly from the testing framework. I've used it for numerous automation-jobs.

OTHER TIPS

I feel compelled to answer this, even though its an old post... I've been working with PHP curl a lot and it is not as good anywhere near comparable to something like WWW:Mechanize, which I am switching to (I think I am going to go with the Ruby language implementation).. Curl is outdated as it requires too much "grunt work" to automate anything, the simpletest scriptable browser looked promising to me but in testing it, it won't work on most web forms I try it on... honestly, I think PHP is lacking in this category of scraping, web automation so its best to look at a different language, just wanted to post this since I have spent countless hours on this topic and maybe it will save someone else some time in the future.

It's 2016 now and there's Mink. It even supports different engines from headless pure-PHP "browser" (without JavaScript), over Selenium (which needs a browser like Firefox or Chrome) to a headless "browser.js" in NPM, which DOES support JavaScript.

Try looking in the PEAR library. If all else fails, create an object wrapper for curl.

You can so something simple like this:

class curl {
    private $resource;

    public function __construct($url) {
        $this->resource = curl_init($url);
    }

    public function __call($function, array $params) {
        array_unshift($params, $this->resource);
        return call_user_func_array("curl_$function", $params);
    }
}

Try one of the following:

(Yes, it's ZendFramework code, but it doesn't make your class slower using it since it just loads the required libs.)

Curl is the way to go for simple requests. It runs cross platform, has a PHP extension and is widely adopted and tested.

I created a nice class that can GET and POST an array of data (INCLUDING FILES!) to a url by just calling CurlHandler::Get($url, $data) || CurlHandler::Post($url, $data). There's an optional HTTP User authentication option too :)

/**
 * CURLHandler handles simple HTTP GETs and POSTs via Curl 
 * 
 * @package Pork
 * @author SchizoDuckie
 * @copyright SchizoDuckie 2008
 * @version 1.0
 * @access public
 */
class CURLHandler
{

    /**
     * CURLHandler::Get()
     * 
     * Executes a standard GET request via Curl.
     * Static function, so that you can use: CurlHandler::Get('http://www.google.com');
     * 
     * @param string $url url to get
     * @return string HTML output
     */
    public static function Get($url)
    {
       return self::doRequest('GET', $url);
    }

    /**
     * CURLHandler::Post()
     * 
     * Executes a standard POST request via Curl.
     * Static function, so you can use CurlHandler::Post('http://www.google.com', array('q'=>'StackOverFlow'));
     * If you want to send a File via post (to e.g. PHP's $_FILES), prefix the value of an item with an @ ! 
     * @param string $url url to post data to
     * @param Array $vars Array with key=>value pairs to post.
     * @return string HTML output
     */
    public static function Post($url, $vars, $auth = false) 
    {
       return self::doRequest('POST', $url, $vars, $auth);
    }

    /**
     * CURLHandler::doRequest()
     * This is what actually does the request
     * <pre>
     * - Create Curl handle with curl_init
     * - Set options like CURLOPT_URL, CURLOPT_RETURNTRANSFER and CURLOPT_HEADER
     * - Set eventual optional options (like CURLOPT_POST and CURLOPT_POSTFIELDS)
     * - Call curl_exec on the interface
     * - Close the connection
     * - Return the result or throw an exception.
     * </pre>
     * @param mixed $method Request Method (Get/ Post)
     * @param mixed $url URI to get or post to
     * @param mixed $vars Array of variables (only mandatory in POST requests)
     * @return string HTML output
     */
    public static function doRequest($method, $url, $vars=array(), $auth = false)
    {
        $curlInterface = curl_init();

        curl_setopt_array ($curlInterface, array( 
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_FOLLOWLOCATION =>1,
            CURLOPT_HEADER => 0));
        if (strtoupper($method) == 'POST')
        {
            curl_setopt_array($curlInterface, array(
                CURLOPT_POST => 1,
                CURLOPT_POSTFIELDS => http_build_query($vars))
            );  
        }
        if($auth !== false)
        {
              curl_setopt($curlInterface, CURLOPT_USERPWD, $auth['username'] . ":" . $auth['password']);
        }
        $result = curl_exec ($curlInterface);
        curl_close ($curlInterface);

        if($result === NULL)
        {
            throw new Exception('Curl Request Error: '.curl_errno($curlInterface) . " - " . curl_error($curlInterface));
        }
        else
        {
            return($result);
        }
    }

}

?>

[edit] Read the clarification only now... You probably want to go with one of the tools mentioned above that automates stuff. You could also decide to use a clientside firefox extension like ChickenFoot for more flexibility. I'll leave the example class above here for future searches.

If you're using CakePHP in your project, or if you're inclined to extract the relevant library you can use their curl wrapper HttpSocket. It has the simple page-fetching syntax you describe, e.g.,

# This is the sugar for importing the library within CakePHP       
App::import('Core', 'HttpSocket');
$HttpSocket = new HttpSocket();

$result = $HttpSocket->post($login_url,
array(
  "username" => "username",
  "password" => "password"
)
);

...although it doesn't have a way to parse the response page. For that I'm going to use simplehtmldom: http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/ which describes itself as having a jQuery-like syntax.

I tend to agree that the bottom line is that PHP doesn't have the awesome scraping/automation libraries that Perl/Ruby have.

If you're on a *nix system you could use shell_exec() with wget, which has a lot of nice options.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top