Вопрос

Using Python 2.6.6 on CentOS 6.4

import urllib
#url = 'http://www.google.com.hk'    #ok
#url = 'http://clients1.google.com.hk'    #ok
#url = 'http://clients1.google.com.hk/complete/search'  #ok (blank)
url  = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc'  #fails
print url
page = urllib.urlopen(url).read()
print page

Using the first 3 URLs, the code works. But with the 4th URL, Python gives the following 302:

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://clients1.google.com.hk/complete/search?output=toolbar&amp;hl=zh-CN&amp;q=abc">here</A>.
</BODY></HTML>

The URL in my code is the same as the URL it tells me to use:

My URL:  http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Its URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc

Google says URL moved, but the URLs are the same. Any ideas why?

Update: The URLs all work fine in a browser. But in Python command line the 4th URL is giving a 302.

Это было полезно?

Решение

urllib is ignoring the cookies and sending the new request without cookies, so it causes a redirect loop at that URL. To handle this you can use urllib2 (which is more up-to-date) and add a cookie handler:

import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open('http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc')
print response.read()

Другие советы

It most likely has to do with the headers and perhaps cookies. I did a quick test on the command-line using curl. It also gives me the 302 moved. The Location header it provides is different, as is the one in the document. If I follow the body URL I get a 204 response (weird). If I follow the Location header I end up getting a circular response like you indicate.

Perhaps important is the Set-Cookie header. It may be redirecting until it gets an appropriate cookie set. It may also be scanning the User-Agent and doing something based on that. Those are the big aspects that differentiate a browser from a tool like requests, or urlib. The browser creates sessions, stores cookies, and sends different headers.

I don't know why urllib fails (I get the same response), however requests lib works perfectly:

import requests
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc'    # fails
print (requests.get(url).text)

If you use your favorite web debugger (Fiddler for me) and open up that URL in your browser, you'll see that you also get that initial 302 response. Your browser is just smart enough to redirect you automatically. So your code is returning the correct response. If you want your code to redirect to the new URL automatically, then you have to make your code smart enough to do so.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top