urllib2 ファイル名

https://stackoverflow.com/questions/163009

03-07-2019
|

質問

urllib2 を使用してファイルを開くと、次のようになります。

remotefile = urllib2.urlopen('http://example.com/somefile.zip')

元の URL を解析する以外にファイル名を取得する簡単な方法はありますか?

編集：openfile を urlopen に変更しました...それがどのように起こったのかわかりません。

編集2:私は最終的に使用しました:

filename = url.split('/')[-1].split('#')[0].split('?')[0]

私が間違っていない限り、これにより、潜在的なクエリもすべて削除されるはずです。

解決

urllib2.urlopen ？

remotefile.info（）[「Content-Disposition ' ] 、しかし、それだけで、URLを解析する必要があると思います。

urlparse.urlsplit を使用できますが、2番目の例のようなURLがある場合は、いずれにしてもファイル名を自分で取り出す必要があります。

>>> urlparse.urlsplit('http://example.com/somefile.zip')
('http', 'example.com', '/somefile.zip', '', '')
>>> urlparse.urlsplit('http://example.com/somedir/somefile.zip')
('http', 'example.com', '/somedir/somefile.zip', '', '')

同様にこれを行うこともできます：

>>> 'http://example.com/somefile.zip'.split('/')[-1]
'somefile.zip'
>>> 'http://example.com/somedir/somefile.zip'.split('/')[-1]
'somefile.zip'

他のヒント

http://example.com/somedir/somefile.zip?foo=bar この場合、os.path.basenameを使用できます：

[user@host]$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) 
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.basename("http://example.com/somefile.zip")
'somefile.zip'
>>> os.path.basename("http://example.com/somedir/somefile.zip")
'somefile.zip'
>>> os.path.basename("http://example.com/somedir/somefile.zip?foo=bar")
'somefile.zip?foo=bar'

urlparseを使用して言及した他のポスターのいくつかは機能しますが、ファイル名から先頭のディレクトリを削除する必要があります。 os.path.basename（）を使用する場合、URLまたはファイルパスの最後の部分のみを返すため、心配する必要はありません。

＆quot;ファイル名＆quot; HTTP転送に関しては、あまり明確に定義された概念ではありません。サーバーは、「content-disposition」としてこれを提供することがあります（ただし、必須ではありません）。ヘッダーを取得するには、 remotefile.headers ['Content-Disposition'] を使用して取得してください。これが失敗した場合、おそらく自分でURIを解析する必要があります。

これは私が普通に見たものです。

filename = url.split("?")[0].split("/")[-1]

urlsplit を使用するのが最も安全なオプションです。

url = 'http://example.com/somefile.zip'
urlparse.urlsplit(url).path.split('/')[-1]

ということですか？ urllib2.urlopen?という関数はありません openfile の中に urllib2 モジュール。

とにかく、 urllib2.urlparse 機能：

>>> from urllib2 import urlparse
>>> print urlparse.urlsplit('http://example.com/somefile.zip')
('http', 'example.com', '/somefile.zip', '', '')

出来上がり。

2つの最高評価の回答を組み合わせることもできます。 urllib2.urlparse.urlsplit（）を使用してURLのパス部分を取得し、次に実際のファイル名のos.path.basenameを取得します。

完全なコードは次のようになります：

>>> remotefile=urllib2.urlopen(url)
>>> try:
>>>   filename=remotefile.info()['Content-Disposition']
>>> except KeyError:
>>>   filename=os.path.basename(urllib2.urlparse.urlsplit(url).path)

os.path.basename 関数は、ファイルパスだけでなくURLでも機能するため、URLを手動で解析する必要はありません。また、リダイレクト応答を追跡するには、元のURLの代わりに result.url を使用する必要があることに注意することが重要です。

import os
import urllib2
result = urllib2.urlopen(url)
real_url = urllib2.urlparse.urlparse(result.url)
filename = os.path.basename(real_url.path)

構文解析の意味に依存すると思います。 URLを解析せずにファイル名を取得する方法はありません。つまり、リモートサーバーはファイル名を提供しません。ただし、自分で多くを行う必要はありません。 urlparse モジュールがあります：

In [9]: urlparse.urlparse('http://example.com/somefile.zip')
Out[9]: ('http', 'example.com', '/somefile.zip', '', '', '')

私が知っていることではない。

ただし、次のように簡単に解析できます：

url = 'http://example.com/somefile.zip'
print url.split('/')[-1]

リクエストを使用しますが、urllib（2）で簡単に行えます

import requests
from urllib import unquote
from urlparse import urlparse

sample = requests.get(url)

if sample.status_code == 200:
    #has_key not work here, and this help avoid problem with names

    if filename == False:

        if 'content-disposition' in sample.headers.keys():
            filename = sample.headers['content-disposition'].split('filename=')[-1].replace('"','').replace(';','')

        else:

            filename = urlparse(sample.url).query.split('/')[-1].split('=')[-1].split('&')[-1]

            if not filename:

                if url.split('/')[-1] != '':
                    filename = sample.url.split('/')[-1].split('=')[-1].split('&')[-1]
                    filename = unquote(filename)

ここでは、おそらく簡単な正規表現を使用できます。次のようなもの：

In [26]: import re
In [27]: pat = re.compile('.+[\/\?#=]([\w-]+\.[\w-]+(?:\.[\w-]+)?$)')
In [28]: test_set 

['http://www.google.com/a341.tar.gz',
 'http://www.google.com/a341.gz',
 'http://www.google.com/asdasd/aadssd.gz',
 'http://www.google.com/asdasd?aadssd.gz',
 'http://www.google.com/asdasd#blah.gz',
 'http://www.google.com/asdasd?filename=xxxbl.gz']

In [30]: for url in test_set:
   ....:     match = pat.match(url)
   ....:     if match and match.groups():
   ....:         print(match.groups()[0])
   ....:         

a341.tar.gz
a341.gz
aadssd.gz
aadssd.gz
blah.gz
xxxbl.gz

オペレーティングシステムではない PurePosixPath の使用＆＃8212;依存し、URLを適切に処理するのがPythonicソリューションです：

>>> from pathlib import PurePosixPath
>>> path = PurePosixPath('http://example.com/somefile.zip')
>>> path.name
'somefile.zip'
>>> path = PurePosixPath('http://example.com/nested/somefile.zip')
>>> path.name
'somefile.zip'

ここにネットワークトラフィックがないか、または何も存在しないことに注意してください（つまり、これらのURLはどこにも行かない）-標準の解析ルールを使用するだけです。

import os,urllib2
resp = urllib2.urlopen('http://www.example.com/index.html')
my_url = resp.geturl()

os.path.split(my_url)[1]

# 'index.html'

これはopenfileではありませんが、おそらく役立つでしょう：）

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow