Question

I thought this would be fairly simple but it's proving challenging. Google uses https:// now and bing redirects to remove HTTP://.

How can I grab the top 5 URLs for a given search term?

I've tried several methods (including loading results into an iframe), but keep hitting brick walls with everything I try.

I wouldn't even need a proxy, as I'm talking about a very small amount results to be harvested, and will only use it for 20-30 terms once ever few months. Hardly enough to trigger whiplash from the search giants.

Any help would be much appreciated!

Here's one example of what I've tried:

$query = urlencode("test"); 

preg_match_all('/<a title=".*?" href=(.*?)>/', file_get_contents("http://www.bing.com/search?q=" . urlencode($query) ), $matches); 

echo implode("<br>", $matches[1]);
Was it helpful?

Solution

There's three main ways to do this. Firstly, use the official API for the search engine you're using - Google has one, and most of them will. These are often volume limited, but for the numbers you're talking about, you'll be fine.

The second way is to use a scraper program to visit the search page, enter a search term, and submit the associated form. Since you've specified PHP, I'd recommend Goutte. Internally it uses Guzzle and Symfony Components, so it must be good! The README at the above link shows you how easy it is. Selection of HTML fragments is done using either XPath or CSS, so it is flexible too.

Lastly, given the low volume of required scrapes, consider downloading a free software package from Import.io. This lets you build a scraper using a point-and-click interface, and it learns how to scrape various areas of the page before storing the data in a local or cloud database.

OTHER TIPS

You can also use a third party service like Serp Api to get Google results.

It should be pretty easy to integrate::

$query = [
    "q" => "Coffee",
    "google_domain" => "google.com",
];

$serp = new GoogleSearchResults();
$json_results = $serp.json($query);

GitHub project.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top