Preventing a "hidden" redirect with urlopen() in Python

https://stackoverflow.com/questions/16974321

31-05-2022
|

Pregunta

I am using BeautifulSoup for web scraping and I am having problems with a particular type of website when using urlopen. Every item on the website has its own unique page and the item comes in different formats (ex: 500 mL, 1L, 2L,...).

When I open the URL of the product (www.example.com/product1) using my Internet Browser, I would see a picture of the 500 mL format, information about it (price, quantity, flavor, etc.) and a list of all the other formats available for this specific item. If a click on another format (ex: 1L), the picture and the information about the item would change but the URL at the top of my browser would stay the same (www.example.com/product1). However, I know by inspecting the HTML code of the page that all the format have their own unique URL (500 mL : www.example.com/product1/123; 1L : www.example.com/product1/456, ...). When using the unique URL of the 1L format in my Internet Browser, I am automatically redirected to the page www.example.com/product1 but the picture and the information displayed on the page corresponds to the 1L format. The HTML code also contains the information that I need about the 1L format.

My problem arises when I use urlopen to open these unique URLs.

from bs4 import BeautifulSoup 
from urllib import urlopen
webpage = urlopen('www.example.com/product1/456')
soup=BeautifulSoup(webpage)
print soup

The information contained in the soup does not correspond to the information displayed using my Internet Browser for the unique URL: www.example.com/product1/456. It gives me the information about the item format displayed by default on www.example.com/product1 which is always the 500 mL format.

Is there any way I can prevent this redirection that would allow me to capture with BeautifulSoup the information contained in the HTML code of the unique URLs?

Solución

import urllib2

class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
        result.status = code
        return result
    http_error_301 = http_error_303 = http_error_307 = http_error_302

opener = urllib2.build_opener(RedirectHandler())
webpage = opener.open('http://www.example.com/product1/456')
...

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow