Question

I am using BeautifulSoup for web scraping and I am having problems with a particular type of website when using urlopen. Every item on the website has its own unique page and the item comes in different formats (ex: 500 mL, 1L, 2L,...).

When I open the URL of the product (www.example.com/product1) using my Internet Browser, I would see a picture of the 500 mL format, information about it (price, quantity, flavor, etc.) and a list of all the other formats available for this specific item. If a click on another format (ex: 1L), the picture and the information about the item would change but the URL at the top of my browser would stay the same (www.example.com/product1). However, I know by inspecting the HTML code of the page that all the format have their own unique URL (500 mL : www.example.com/product1/123; 1L : www.example.com/product1/456, ...). When using the unique URL of the 1L format in my Internet Browser, I am automatically redirected to the page www.example.com/product1 but the picture and the information displayed on the page corresponds to the 1L format. The HTML code also contains the information that I need about the 1L format.

My problem arises when I use urlopen to open these unique URLs.

from bs4 import BeautifulSoup 
from urllib import urlopen
webpage = urlopen('www.example.com/product1/456')
soup=BeautifulSoup(webpage)
print soup    

The information contained in the soup does not correspond to the information displayed using my Internet Browser for the unique URL: www.example.com/product1/456. It gives me the information about the item format displayed by default on www.example.com/product1 which is always the 500 mL format.

Is there any way I can prevent this redirection that would allow me to capture with BeautifulSoup the information contained in the HTML code of the unique URLs?

Was it helpful?

Solution

import urllib2

class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
        result.status = code
        return result
    http_error_301 = http_error_303 = http_error_307 = http_error_302

opener = urllib2.build_opener(RedirectHandler())
webpage = opener.open('http://www.example.com/product1/456')
...
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top