phantomjs page.evaluate to scrape "resultStats" from http://www.google.com/search?q=site:%s works in local server but not production server

StackOverflow https://stackoverflow.com/questions/21490452

  •  05-10-2022
  •  | 
  •  

Question

Using phantomjs page.evaluate to extract "resultStats" (div id) from http://www.google.com/search/?q=site:%s works on my local server but not on production server.

NOTE: I'm using the latest phantomjs 1.9.7, however I experienced the same issue with the previous version 1.9.6

NOTE: Phantomjs page.render (on Google home page as well as any other domain name) is working on both servers and creates nice screenshots.

On my production server (Debian stable 7.3 @linode.com) the PHP code below for a top level domain name as the "$url" returns:

TypeError: 'null' is not an object (evaluating 'document.getElementById('resultStats').textContent') phantomjs://webpage.evaluate():2 phantomjs://webpage.evaluate():3 phantomjs://webpage.evaluate():3 null

On my local server (debian testing) the PHP code below for the same "$url" returns:

About 43 results

This happens with any domain name/url I use as the argument - I've tested it on dozens.

What might cause this to occur in my remote production server and not my local server?

gsiteindex.js

var page = require('webpage').create(), site;
var site = phantom.args[0]; 
page.open("https://www.google.com/search?q=site:" + site, function (status) {     
  var result = page.evaluate(function () {
    return document.getElementById('resultStats').textContent;
  }); 
  console.info(result);
  phantom.exit();
});

.php

$phantomjs = "phantomjs";
$script = "gsiteindex.js";
$site = $url;   
$command = "$phantomjs $script $site";
$googlestring = shell_exec($command);
echo $googlestring;
die();
Était-ce utile?

La solution

Well, nrabinowitz was right. I tested it more on my own server using proxies, most timed out, some returned the above error, and a couple returned correct results (well I assume they were correct based on the location the IP address of the proxy - because the figures were a little different than using my ISPs public IP address (calif., USA)).

So it's simply a matter of google blocking certain types of requests from certain IP addresses.

Thanks again for the comment.

Autres conseils

Incleude header with user-agent e.g.

header = {'user-asgent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}

Withuot user agent you get googles gefault style page without resultStats a also had this issue and adding header helped

Default google search page looks like this enter image description here

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top