Python : urllib2.urlopen 호출에서 http 헤더를 얻습니까?

https://stackoverflow.com/questions/843392

20-08-2019
|

문제

하다 urllib2 a urlopen 전화가 만들어 졌나요?

페이지를 얻지 않고 HTTP 응답 헤더를 읽고 싶습니다. 보입니다 urllib2 HTTP 연결을 열고 나중에 실제 HTML 페이지를 가져옵니다. urlopen 전화?

import urllib2
myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/'
page = urllib2.urlopen(myurl) // open connection, get headers

html = page.readlines()  // stream page

해결책

사용 response.info() 헤더를 얻는 방법.

로부터 urllib2 문서:

urllib2.urlopen (url [, data] [, timeout]))

...

이 함수는 두 가지 추가 방법으로 파일과 같은 객체를 반환합니다.

geturl () - 검색 된 리소스의 URL을 반환합니다. 일반적으로 리디렉션을 준수했는지 확인하는 데 사용됩니다.

info ()-httplib.httpmessage 인스턴스 형식으로 헤더와 같은 페이지의 메타 정보를 반환합니다 (HTTP 헤더에 대한 빠른 참조 참조).

따라서 예를 들어, 결과를 밟으십시오. response.info().headers 당신이 찾고있는 것을 위해.

httplib.httpmessage 사용에 대한 주요 경고는 다음과 같습니다. 파이썬 문제 4773.

다른 팁

일반적인 GET 요청 대신 헤드 요청을 보내는 것은 어떻습니까? 다음은 Snipped (유사한 것으로 복사 의문) 정확히합니다.

>>> import httplib
>>> conn = httplib.HTTPConnection("www.google.com")
>>> conn.request("HEAD", "/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> print res.getheaders()
[('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')]

실제로 urllib2가 HTTP 헤드 요청을 수행 할 수있는 것으로 보입니다.

그만큼 의문 위의 @reto는 urllib2가 헤드 요청을 수행하는 방법을 보여줍니다.

여기에 내 테이크가 있습니다.

import urllib2

# Derive from Request class and override get_method to allow a HEAD request.
class HeadRequest(urllib2.Request):
    def get_method(self):
        return "HEAD"

myurl = 'http://bit.ly/doFeT'
request = HeadRequest(myurl)

try:
    response = urllib2.urlopen(request)
    response_headers = response.info()

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response_headers.dict

except urllib2.HTTPError, e:
    # Prints the HTTP Status code of the response but only if there was a 
    # problem.
    print ("Error code: %s" % e.code)

Wireshark Network 프로토콜 아날레이 제와 같은 것으로 확인하면 실제로 GET가 아닌 헤드 요청을 보내는 것을 알 수 있습니다.

이것은 Wireshark에서 캡처 한대로 위의 코드의 HTTP 요청 및 응답입니다.

헤드 /도펫 HTTP /1.1
인코딩 수락 : 신원
호스트 : Bit.ly
연결 : 닫습니다
사용자 에이전트 : Python-urllib/2.7

HTTP/1.1 301 이동
서버 : nginx
날짜 : Sun, 2012 년 2 월 19 일 13:20:56 GMT
내용 유형 : 텍스트/html; charset = UTF-8
캐시 제어 : 개인; Max-Age = 90
위치: http://www.kidsidebyside.org/?p=445
마임 버전 : 1.0
컨텐츠 길이 : 127
연결 : 닫습니다
set-cookie : _BIT = 4F40F738-00153-02ed0-421CF10A; domain = .bit.ly; 만료 = 8 월 17 일 13:20:56 2012; Path =/; httponly

그러나 다른 질문의 의견 중 하나에서 언급했듯이, 해당 URL에 리디렉션이 포함 된 경우 URLLIB2는 헤드가 아닌 대상에 대한 요청을 수행합니다. 당신이 실제로 머리 요청 만하고 싶다면 이것은 큰 단점 일 수 있습니다.

위의 요청에는 리디렉션이 포함됩니다. Wireshark에서 캡처 한대로 목적지에 대한 요청은 다음과 같습니다.

get/2009/05/come-and-draw-the-colle-of-unity-us-us/http/1.1
인코딩 수락 : 신원
호스트 : www.kidsidebyside.org
연결 : 닫습니다
사용자 에이전트 : Python-urllib/2.7

urllib2를 사용하는 대안은 Joe Gregorio의 사용입니다. httplib2 도서관:

import httplib2

url = "http://bit.ly/doFeT"
http_interface = httplib2.Http()

try:
    response, content = http_interface.request(url, method="HEAD")
    print ("Response status: %d - %s" % (response.status, response.reason))

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response.__dict__

except httplib2.ServerNotFoundError, e:
    print (e.message)

이는 초기 HTTP 요청과 대상 URL에 대한 리디렉션 된 요청 모두에 헤드 요청을 사용하는 이점이 있습니다.

첫 번째 요청은 다음과 같습니다.

헤드 /도펫 HTTP /1.1
호스트 : Bit.ly
인코딩 수락 : gzip, deflate
사용자 에이전트 : Python-httplib2/0.7.2 (GZIP)

대상에 대한 두 번째 요청은 다음과 같습니다.

Head/2009/05/Come-and-Draw-the-Colle-of-Unity-US/HTTP/1.1
호스트 : www.kidsidebyside.org
인코딩 수락 : gzip, deflate
사용자 에이전트 : Python-httplib2/0.7.2 (GZIP)

urllib2.urlopen HTTP 헤드가 아닌 HTTP GET (또는 데이터 인수를 제공하는 경우 게시)를 수행합니다 (후자가 한 경우 판독 또는 기타 페이지 본문에 대한 다른 액세스를 할 수 없음).

짧막 한 농담:

$ python -c "import urllib2; print urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1)).open(urllib2.Request('http://google.com'))"

def _GetHtmlPage(self, addr):
  headers = { 'User-Agent' : self.userAgent,
            '  Cookie' : self.cookies}

  req = urllib2.Request(addr)
  response = urllib2.urlopen(req)

  print "ResponseInfo="
  print response.info()

  resultsHtml = unicode(response.read(), self.encoding)
  return resultsHtml

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow