urllib2 파일 이름

https://stackoverflow.com/questions/163009

03-07-2019
|

문제

면 내가 사용하여 파일을 열 urllib2 다음과 같이:

remotefile = urllib2.urlopen('http://example.com/somefile.zip')

이 있을 얻을 수있는 가장 쉬운 방법 파일의 이름을 기타 다음 분석가 원래 URL?

편집:변 openfile 을 urlopen...는 방법을 확실하지 않는 일이 일어났습니다.

EDIT2:나는 사용:

filename = url.split('/')[-1].split('#')[0].split('?')[0]

내가 잘못이해 지구의 모든 잠재적인 쿼리뿐만 아니라.

해결책

그런 뜻 이었습니까 urllib2.urlopen?

잠재적으로 들어 올릴 수 있습니다 예정된 파일 이름 만약에 서버는 확인하여 컨텐츠 방지 헤더를 보내고있었습니다 remotefile.info()['Content-Disposition'], 그러나 그것이 바로 URL을 구문 분석해야한다고 생각합니다.

당신은 사용할 수 있습니다 urlparse.urlsplit, 그러나 두 번째 예제에서와 같은 URL이 있으면 어쨌든 파일 이름을 직접 가져와야합니다.

>>> urlparse.urlsplit('http://example.com/somefile.zip')
('http', 'example.com', '/somefile.zip', '', '')
>>> urlparse.urlsplit('http://example.com/somedir/somefile.zip')
('http', 'example.com', '/somedir/somefile.zip', '', '')

이것을 할 수도 있습니다.

>>> 'http://example.com/somefile.zip'.split('/')[-1]
'somefile.zip'
>>> 'http://example.com/somedir/somefile.zip'.split('/')[-1]
'somefile.zip'

다른 팁

당신만을 원하는 파일 이름을 자체,가정이 없다는 것을 질의변수를 끝에서 같은 http://example.com/somedir/somefile.zip?foo=bar 다음 사용할 수 있습니다 os.경로에 있습니다.basename 이:

[user@host]$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) 
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.basename("http://example.com/somefile.zip")
'somefile.zip'
>>> os.path.basename("http://example.com/somedir/somefile.zip")
'somefile.zip'
>>> os.path.basename("http://example.com/somedir/somefile.zip?foo=bar")
'somefile.zip?foo=bar'

다른 포스터를 사용하여 언급 urlparse 는 일할 것이다,하지만 당신은 여전히 필요를 선도하는 디렉토리에서 파일 이름입니다.당신이 사용하는 경우 os.경로에 있습니다.basename()당신은하지 않는 그것에 대해 걱정할 필요가 없기 때문에,반환합 URL 의 마지막 부분 또는 파일 경로.

"파일 이름"은 HTTP 전송과 관련하여 잘 정의 된 개념이 아니라고 생각합니다. 서버는 "Content-Disposition"헤더로 하나를 제공 할 수 있지만 remotefile.headers['Content-Disposition']. 이것이 실패하면 아마도 URI를 직접 구문 분석해야 할 것입니다.

그냥 내가 평소에하는 것을 보았다 ..

filename = url.split("?")[0].split("/")[-1]

사용 urlsplit 가장 안전한 옵션입니다.

url = 'http://example.com/somefile.zip'
urlparse.urlsplit(url).path.split('/')[-1]

당신은 의미합니까? urllib2.urlopen? 호출 된 기능이 없습니다 openfile 에서 urllib2 기준 치수.

어쨌든 사용하십시오 urllib2.urlparse 기능 :

>>> from urllib2 import urlparse
>>> print urlparse.urlsplit('http://example.com/somefile.zip')
('http', 'example.com', '/somefile.zip', '', '')

진성.

또한 URLLIB2.urlparse.urlsplit ()를 사용하여 URL의 경로 부분을 가져온 다음 실제 파일 이름에 대해서는 os.path.basename을 결합 할 수도 있습니다.

전체 코드는 다음과 같습니다.

>>> remotefile=urllib2.urlopen(url)
>>> try:
>>>   filename=remotefile.info()['Content-Disposition']
>>> except KeyError:
>>>   filename=os.path.basename(urllib2.urlparse.urlsplit(url).path)

그만큼 os.path.basename 기능은 파일 경로뿐만 아니라 URL에도 작동하므로 URL을 수동으로 구문 분석 할 필요가 없습니다. 또한 사용해야한다는 점에 유의해야합니다. result.url 리디렉션 응답을 따르기 위해 원래 URL 대신 :

import os
import urllib2
result = urllib2.urlopen(url)
real_url = urllib2.urlparse.urlparse(result.url)
filename = os.path.basename(real_url.path)

나는 그것이 당신이 구문 분석한다는 의미에 달려 있다고 생각합니다. URL을 구문 분석하지 않고 파일 이름을 얻을 방법이 없습니다. 즉 원격 서버는 파일 이름을 제공하지 않습니다. 그러나 당신은 스스로 할 필요가 없습니다. urlparse 기준 치수:

In [9]: urlparse.urlparse('http://example.com/somefile.zip')
Out[9]: ('http', 'example.com', '/somefile.zip', '', '', '')

내가 아는 한에서는 아니다.

그러나 다음과 같이 쉽게 구문 분석 할 수 있습니다.

url = 'http://example.com/somefile.zip'
print url.split('/')[-1]

요청을 사용하지만 urllib (2)로 쉽게 할 수 있습니다.

import requests
from urllib import unquote
from urlparse import urlparse

sample = requests.get(url)

if sample.status_code == 200:
    #has_key not work here, and this help avoid problem with names

    if filename == False:

        if 'content-disposition' in sample.headers.keys():
            filename = sample.headers['content-disposition'].split('filename=')[-1].replace('"','').replace(';','')

        else:

            filename = urlparse(sample.url).query.split('/')[-1].split('=')[-1].split('&')[-1]

            if not filename:

                if url.split('/')[-1] != '':
                    filename = sample.url.split('/')[-1].split('=')[-1].split('&')[-1]
                    filename = unquote(filename)

여기서 간단한 정규 표현을 사용할 수 있습니다. 같은 것 :

In [26]: import re
In [27]: pat = re.compile('.+[\/\?#=]([\w-]+\.[\w-]+(?:\.[\w-]+)?$)')
In [28]: test_set 

['http://www.google.com/a341.tar.gz',
 'http://www.google.com/a341.gz',
 'http://www.google.com/asdasd/aadssd.gz',
 'http://www.google.com/asdasd?aadssd.gz',
 'http://www.google.com/asdasd#blah.gz',
 'http://www.google.com/asdasd?filename=xxxbl.gz']

In [30]: for url in test_set:
   ....:     match = pat.match(url)
   ....:     if match and match.groups():
   ....:         print(match.groups()[0])
   ....:         

a341.tar.gz
a341.gz
aadssd.gz
aadssd.gz
blah.gz
xxxbl.gz

사용 Pureposixpath 운영 체제가 아닙니다. 의존적이며 URL을 우아하게 처리합니다.

>>> from pathlib import PurePosixPath
>>> path = PurePosixPath('http://example.com/somefile.zip')
>>> path.name
'somefile.zip'
>>> path = PurePosixPath('http://example.com/nested/somefile.zip')
>>> path.name
'somefile.zip'

여기에 네트워크 트래픽이없는 방법에 주목하십시오 (예 : URL은 어디에도 가지 않습니다) - 표준 구문 분석 규칙을 사용합니다.

import os,urllib2
resp = urllib2.urlopen('http://www.example.com/index.html')
my_url = resp.geturl()

os.path.split(my_url)[1]

# 'index.html'

이것은 OpenFile이 아니지만 여전히 도움이 될 수 있습니다 :)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow