Question

I'm trying to scrape product listing pages that display the vendors and prices of particular products, but urllib.urlopen isn't working--it will work on all other pages on Amazon, but I'm kind of wondering if Amazon's bots prevent scraping on product listing pages. Can anyone verify this? Using Chrome I can still view page source...

Here's an example of a product listing page I would want to scrape: http://www.amazon.com/gp/offer-listing/B007E84H96/ref=dp_olp_new?ie=UTF8&condition=new

Was it helpful?

Solution

Trying curl -I on that URL returns MethodNotAllowed:

$ curl -I 'http://www.amazon.com/gp/offer-listing/B007E84H96/ref=dp_olp_new?ie=UTF8&condition=new' 
HTTP/1.1 405 MethodNotAllowed
Date: Wed, 13 Feb 2013 16:41:08 GMT
Server: Server
x-amz-id-1: 1WKZG9N0SE87E3KFG6YV
allow: POST, GET
x-amz-id-2: Apluv2QBzzrmXlRWjlClRGsQQ1TbwsxObe2hxfdrGhO/OQziI/aIT3vkVjCPn+qz
Vary: Accept-Encoding,User-Agent
Content-Type: text/html; charset=ISO-8859-1

and adding a User-Agent string with the -A switch didn't effect that return value.

You might experiment with different http headers to see if you can find something that passess. But it's pretty obvious that Amazon wouldn't want you to screen scrape prices from their product pages. And a little googling brings up this page:

http://www.distil.it/amazon-cracks-down-on-price-scraping/#.URvBFo4ry0s

With no fanfare or warning, Amazon in June began enforcing a long-standing policy prohibiting screen-scraping tools from harvesting listing information directly from its marketplace, a favorite tool for providers of repricing services for merchants, according to a third-party developer.

Note also that Amazon has an API for their affiliates -- there are some related questions about using that API from python in the "Related" question links on the right column.

OTHER TIPS

Have you heard of BeautifulSoup? You might get some mileage out of that...

http://www.crummy.com/software/BeautifulSoup/


More details: BeautifulSoup Grab Visible Webpage Text

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top