الحصول على حجم الملف قبل تحميله في بايثون

https://stackoverflow.com/questions/5909

08-06-2019
|

سؤال

أنا تحميل دليل بأكمله من ملقم ويب.فإنه يعمل على ما يرام, ولكن لا أعرف كيفية الحصول على حجم الملف قبل التحميل مقارنة إذا كان تحديث على الخادم أو لا.ويمكن أن يتم ذلك كما لو كنت تحميل الملف من خادم FTP?

import urllib
import re

url = "http://www.someurl.com"

# Download the page locally
f = urllib.urlopen(url)
html = f.read()
f.close()

f = open ("temp.htm", "w")
f.write (html)
f.close()

# List only the .TXT / .ZIP files
fnames = re.findall('^.*<a href="(\w+(?:\.txt|.zip)?)".*$', html, re.MULTILINE)

for fname in fnames:
    print fname, "..."

    f = urllib.urlopen(url + "/" + fname)

    #### Here I want to check the filesize to download or not #### 
    file = f.read()
    f.close()

    f = open (fname, "w")
    f.write (file)
    f.close()

@جون:شكرا لجهودكم إجابة سريعة.يعمل لكن حجم الملف على ملقم ويب هو أقل قليلا من حجم الملف من الملف الذي تم تنزيله.

أمثلة:

Local Size  Server Size
 2.223.533  2.115.516
   664.603    662.121

فمن لديه أي شيء للقيام مع CR/LF التحويل ؟

المحلول

لقد استنسخت ما ترونه:

import urllib, os
link = "http://python.org"
print "opening url:", link
site = urllib.urlopen(link)
meta = site.info()
print "Content-Length:", meta.getheaders("Content-Length")[0]

f = open("out.txt", "r")
print "File on disk:",len(f.read())
f.close()


f = open("out.txt", "w")
f.write(site.read())
site.close()
f.close()

f = open("out.txt", "r")
print "File on disk after download:",len(f.read())
f.close()

print "os.stat().st_size returns:", os.stat("out.txt").st_size

نواتج هذا:

opening url: http://python.org
Content-Length: 16535
File on disk: 16535
File on disk after download: 16535
os.stat().st_size returns: 16861

ماذا أفعل الخطأ هنا ؟ هو نظام التشغيل.القانون الأساسي().st_size لا ترد على الصحيح الحجم ؟

تحرير:حسنا, لقد فهمت ما المشكلة:

import urllib, os
link = "http://python.org"
print "opening url:", link
site = urllib.urlopen(link)
meta = site.info()
print "Content-Length:", meta.getheaders("Content-Length")[0]

f = open("out.txt", "rb")
print "File on disk:",len(f.read())
f.close()


f = open("out.txt", "wb")
f.write(site.read())
site.close()
f.close()

f = open("out.txt", "rb")
print "File on disk after download:",len(f.read())
f.close()

print "os.stat().st_size returns:", os.stat("out.txt").st_size

هذه النواتج:

$ python test.py
opening url: http://python.org
Content-Length: 16535
File on disk: 16535
File on disk after download: 16535
os.stat().st_size returns: 16535

تأكد من أن يتم فتح كل الملفات الثنائية القراءة/الكتابة.

// open for binary write
open(filename, "wb")
// open for binary read
open(filename, "rb")

نصائح أخرى

باستخدام عاد-urllib-أسلوب كائن info(), يمكنك الحصول على مختلف المعلومات عن retrived الوثيقة.على سبيل المثال من الاستيلاء الحالي شعار جوجل:

>>> import urllib
>>> d = urllib.urlopen("http://www.google.co.uk/logos/olympics08_opening.gif")
>>> print d.info()

Content-Type: image/gif
Last-Modified: Thu, 07 Aug 2008 16:20:19 GMT  
Expires: Sun, 17 Jan 2038 19:14:07 GMT 
Cache-Control: public 
Date: Fri, 08 Aug 2008 13:40:41 GMT 
Server: gws 
Content-Length: 20172 
Connection: Close

انها ديكت ، وذلك للحصول على حجم الملف لديك urllibobject.info()['Content-Length']

print f.info()['Content-Length']

والحصول على حجم الملف المحلي (على سبيل المقارنة) ، يمكنك استخدام نظام التشغيل.القانون الأساسي() الأمر:

os.stat("/the/local/file.zip").st_size

حجم الملف أرسلت رأس طول محتوى.هنا هو كيفية الحصول عليها مع urllib:

>>> site = urllib.urlopen("http://python.org")
>>> meta = site.info()
>>> print meta.getheaders("Content-Length")
['16535']
>>>

أيضا إذا كان الملقم الذي تحاول الاتصال به يدعم ذلك ، انظر Etags و إذا المعدلة منذ و إذا-لا شيء-مباراة رؤوس.

باستخدام هذه الاستفادة من خادم التخزين المؤقت في القواعد و سيعود 304 غير معدلة رمز حالة إذا كان المحتوى لم يتغير.

في Python3:

>>> import urllib.request
>>> site = urllib.request.urlopen("http://python.org")
>>> print("FileSize: ", site.length)

عن python3 (اختبار على 3.5) نهج أنصح:

with urlopen(file_url) as in_file, open(local_file_address, 'wb') as out_file:
    print(in_file.getheader('Content-Length'))
    out_file.write(response.read())

A طلباتالحل القائم على استخدام الرأس بدلا من الحصول على (كذلك يطبع رؤوس HTTP):

#!/usr/bin/python
# display size of a remote file without downloading

from __future__ import print_function
import sys
import requests

# number of bytes in a megabyte
MBFACTOR = float(1 << 20)

response = requests.head(sys.argv[1], allow_redirects=True)

print("\n".join([('{:<40}: {}'.format(k, v)) for k, v in response.headers.items()]))
size = response.headers.get('content-length', 0)
print('{:<40}: {:.2f} MB'.format('FILE SIZE', int(size) / MBFACTOR))

الاستخدام

$ python filesize-remote-url.py https://httpbin.org/image/jpeg
...
Content-Length                          : 35588
FILE SIZE (MB)                          : 0.03 MB

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow