Question

Consider the following Python code:

 30    url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
 31    url_object = urllib.request.urlopen(url);
 32    print(url_object.read());

When this is run, an Exception is thrown:

File "/usr/local/lib/python3.0/urllib/request.py", line 485, in http_error_default
   raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

However, when this is put into a browser, the search returns as expected. What's going on here? How can I overcome this so I can search Google programmatically?

Any thoughts?

Was it helpful?

Solution

If you want to do Google searches "properly" through a programming interface, take a look at Google APIs. Not only are these the official way of searching Google, they are also not likely to change if Google changes their result page layout.

OTHER TIPS

this should do the trick

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
headers={'User-Agent':user_agent,} 

request=urllib2.Request(url,None,headers) //The assembled request
response = urllib2.urlopen(request)
data = response.read() // The data u need

As lacqui suggested, the Google API's are the way they want you to make requests from code. Unfortunately, I found their documentation was aimed at people writing AJAX web pages, not making raw HTTP requests. I used LiveHTTP Headers to trace the HTTP requests that the samples made, and I found ddipaolo's blog post helpful.

One more thing that messed me up: they limit you to the first 64 results from a query. Usually not a problem if you are just providing web users with a search box, but not helpful if you're trying to use Google to go data mining. I guess they don't want you to go data mining using their API. That 64 number has changed over time and varies between search products.

Update: It appears they definitely do not want you to go data mining. Eventually, you get a 403 error with a link to this API access notice.

Please review the Terms of Use for the API(s) you are using (linked in the right sidebar) and ensure compliance. It is likely that we blocked you for one of the following Terms of Use violations: We received automated requests, such as scraping and prefetching. Automated requests are prohibited; all requests must be made as a result of an end-user action.

They also list other violations, but I think that's the one that triggered for me. I may have to investigate Yahoo's BOSS service. It doesn't seem to have as many restrictions.

You're doing it too often. Google has limits in place to prevent getting swamped by search bots. You can also try setting the user-agent to something that more closely resembles a normal browser.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top