When I use python requests to check a site, if the site redirects me to another page, will I know?

StackOverflow https://stackoverflow.com/questions/13482777

  •  30-11-2021
  •  | 
  •  

Question

What I mean is, if I go to "www.yahoo.com/thispage", and yahoo has set up a filter to redirect /thispage to /thatpage. So whenever someone goes to /thispage, s/he will land on /thatpage.

If I use httplib/requests/urllib, will it know that there was a redirection? What error pages? Some sites redirect user to /errorpage whenever the page cannot be found.

Was it helpful?

Solution

With requests, you get a listing of any redirects in the .history attribute of the response object. It returns a Python list. See the documentation for more.

OTHER TIPS

To prevent requests from following redirects use:

r = requests.get('http://www.yahoo.com/thispage', allow_redirects=False)

If it is in indeed a redirect, you can check the redirect target location in r.headers['location'].

The accepted answer is the correct first option, but in some cases if the site redirects with a meta tag they also have a canonical link specified once they redirect. In this example let me try to request http://en.wikipedia.org/wiki/Google_Inc_Class_A from wikipedia, which is a url that redirects.

>> request = requests.get('http://en.wikipedia.org/wiki/Google_Inc_Class_A')

I check and:

>> request.history
[]

An alternative is to try and pull the canonical url which should hopefully have what you're been redirected to. (Note I'm using BeautifulSoup here as well)

>> soup = BeautifulSoup(request._content)
>> canonical = soup.find('link', {'rel': 'canonical'})
>> canonical['href']
'http://en.wikipedia.org/wiki/Google'

Which does match the url you get redirected to in this particular case. So to be clear, this is an ugly second option but worth trying if all else fails.

It depends on how they are doing the redirection. The "right" way is to return a redirected HTTP status code (301/302/303). The "wrong" way is to place a refresh meta tag in the HTML.

If they do the former, requests will handle it transparently. Note that any sane error page redirect will still have an error status code (e.g. 404) which you can check as response.status_code.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top