سؤال

I'm trying to read some content from a URL using python but am getting a 404 every time I try.

Here is my test code, and the offending URL:

url = 'http://supercoach.heraldsun.com.au'

headers = {"User-agent": "Mozilla/5.0"}
req = urllib2.Request(url, None, headers)
try:
   handle = urllib2.urlopen(req)
except IOError, e:
    print e.code

The site works fine in a browser, and I have previously had no issues with this script, but a recent update to the site has caused it to fail.

I've tried adding a user agent header as similar questions have that as a suggestion.

Any ideas why this isn't working?

Thanks JP

هل كانت مفيدة؟

المحلول

Use requests which provides a friendly wrapper around the libraries in Python; and it handles redirection for you.

Your code with requests is simply:

import requests
r = requests.get('http://supercoach.heraldsun.com.au')

نصائح أخرى

Try to set cookies and increase number of allowed redirections:

import urllib2
from cookielib import CookieJar

class RedirectHandler(urllib2.HTTPRedirectHandler):
    max_repeats = 100
    max_redirections = 1000

    def http_error_302(self, req, fp, code, msg, headers):
        print code
        print headers
        return urllib2.HTTPRedirectHandler.http_error_302(
            self, req, fp, code, msg, headers)
    http_error_300 = http_error_302
    http_error_301 = http_error_302
    http_error_303 = http_error_302
    http_error_307 = http_error_302

cookiejar = CookieJar()
urlopen = urllib2.build_opener(RedirectHandler(),
                               urllib2.HTTPCookieProcessor(cookiejar)).open
request = urllib2.Request('http://supercoach.heraldsun.com.au',
                          headers={"User-agent": "Mozilla/5.0"})
response = urlopen(request)
print '*' * 60
print response.info()
print response.read()
response.close()
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top