How to run (and kill) headless browser via cron job for automated scraping in PHP

https://stackoverflow.com/questions/15392067

23-03-2022
|

题

I've been attempting to create a spider that will run via cron every morning. I wanted to incorporate a headless browser in order to get the rendered DOM (after javascript).

I tried using Crowbar (a headless browser) and have had success getting one (and only one) page via cURL with it. Documentation is non-existent, and it hangs after every first request.

How can I kill Crowbar's process via PHP, in essence to be able to start/stop it at will. Or do people just leave these headless browsers running constantly? That seems like a resource drain.

This is the code I've tried, but killing the process does not work.

$toExecute = "\"" .ROOT . "/vendors/xulrunner/xulrunner.exe \" \"". ROOT . "/app/Vendor/crowbar/xulapp/application.ini \" 2>&1 &";
$handle = shell_exec($toExecute);

$curl = curl_init();
curl_setopt ($curl, CURLOPT_URL, 'http://127.0.0.1:10000/?url=' . $url . '&delay=3000&view=as-is');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec ($curl);

exec("kill -KILL ".$handle); //this does not work...
echo $html;

Or is there a better way to scrape in php with javascript? I'd love to hear it...

解决方案

shell_exec() does not return a handle. It returns the output of a command. So, you'd see STDOUT and STDERR in $handle. I would stop this and instead only echo $!, which will be set as PID of the last executed command. This PID you can then pass to kill to properly kill xulrunner.exe

So, to sum it up:

$toExecute = "<path>/xulrunner.exe <params> >/dev/null 2>/dev/null & echo $!";
$myPid = shell_exec( $toExecute );

...

exec( "/bin/kill $myPid" );

Note that for safety you should use escapeshellarg() and escapeshellcmd() where appropriate. Otherwise you are exposing yourself to shenanigans.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow