Come faccio a scaricare un file tramite HTTP, utilizzando Python?

https://stackoverflow.com/questions/22676

09-06-2019
|

Domanda

Ho una piccola utility che uso per scaricare un MP3 da un sito web su un programma e, quindi, crea/aggiorna un podcast file XML che ovviamente ho aggiunto a iTunes.

L'elaborazione del testo che crea/aggiorna il file XML è scritto in Python.Io uso wget all'interno di un Windows .bat file per scaricare l'effettiva MP3 tuttavia.Io preferirei avere l'intera utility scritta in Python, però.

Ho lottato pur di trovare un modo per scaricare il file in Python, quindi perché ho fatto ricorso a wget.

Così, come faccio a scaricare il file utilizzando Python?

Soluzione

In Python 2, utilizzare urllib2 che viene fornito con la libreria standard.

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()

Questo è il modo più semplice per utilizzare la libreria, meno la gestione degli errori.Si può anche fare di più complesso di cose come ad esempio la modifica delle intestazioni.La documentazione può essere trovato qui.

Altri suggerimenti

Uno in più, utilizzando urlretrieve:

import urllib
urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

(per Python 3+ uso import urllib.request e urllib.request.urlretrieve)

Un altro ancora, con un "progressbar"

import urllib2

url = "http://download.thinkbroadband.com/10MB.zip"

file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break

    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,

f.close()

Nel 2012, utilizzare il python richieste biblioteca

>>> import requests
>>> 
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760

È possibile eseguire pip install requests per ottenerlo.

Le richieste ha molti vantaggi rispetto alle alternative, perché l'API è molto più semplice.Questo è particolarmente vero se si ha a che fare l'autenticazione.urllib e urllib2 sono abbastanza istintivo e doloroso in questo caso.

2015-12-30

Persone hanno espresso la loro ammirazione per la barra di avanzamento.È divertente, sicuro.Ci sono diversi off-the-shelf soluzioni, tra cui tqdm:

from tqdm import tqdm
import requests

url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)

with open("10MB", "wb") as handle:
    for data in tqdm(response.iter_content()):
        handle.write(data)

Questo è essenzialmente l'attuazione @kvance descritto 30 mesi fa.

import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
  output.write(mp3file.read())

Il wb in open('test.mp3','wb') apre un file (e cancella tutti i file) in modalità binaria, così è possibile salvare i dati, invece di solo testo.

Python 3

urllib.request.urlopen

import urllib.request
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()

urllib.request.urlretrieve

import urllib.request
urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')

Python 2

urllib2.urlopen (grazie Corey)

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()

urllib.urlretrieve (grazie PabloG)

import urllib
urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')

utilizzare wget modulo:

import wget
wget.download('url')

Una versione migliorata del PabloG codice Python 2/3:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )

import sys, os, tempfile, logging

if sys.version_info >= (3,):
    import urllib.request as urllib2
    import urllib.parse as urlparse
else:
    import urllib2
    import urlparse

def download_file(url, dest=None):
    """ 
    Download and save a file specified by url to dest directory,
    """
    u = urllib2.urlopen(url)

    scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
    filename = os.path.basename(path)
    if not filename:
        filename = 'downloaded.file'
    if dest:
        filename = os.path.join(dest, filename)

    with open(filename, 'wb') as f:
        meta = u.info()
        meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
        meta_length = meta_func("Content-Length")
        file_size = None
        if meta_length:
            file_size = int(meta_length[0])
        print("Downloading: {0} Bytes: {1}".format(url, file_size))

        file_size_dl = 0
        block_sz = 8192
        while True:
            buffer = u.read(block_sz)
            if not buffer:
                break

            file_size_dl += len(buffer)
            f.write(buffer)

            status = "{0:16}".format(file_size_dl)
            if file_size:
                status += "   [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
            status += chr(13)
            print(status, end="")
        print()

    return filename

if __name__ == "__main__":  # Only run if this file is called directly
    print("Testing with 10MB download")
    url = "http://download.thinkbroadband.com/10MB.zip"
    filename = download_file(url)
    print(filename)

Scritto wget libreria in puro Python solo per questo scopo.Si è pompato urlretrieve con queste caratteristiche a partire dalla versione 2.0.

Semplice ma Python 2 & Python 3 modo compatibile e viene fornito con six libreria:

from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

Sono d'accordo con Corey, urllib2 è più completa urllib e dovrebbe probabilmente essere il modulo usato se vuoi fare cose più complesse, ma per rendere le risposte più complete, urllib è un semplice modulo se si desidera solo le nozioni di base:

import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()

Funzionerà bene.O, se non si vuole affrontare con la "risposta" oggetto che si può chiamare read() direttamente:

import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()

Di seguito sono i più comunemente utilizzati chiama per il download di file in python:

urllib.urlretrieve ('url_to_file', file_name)
urllib2.urlopen('url_to_file')
requests.get(url)
wget.download('url', file_name)

Nota: urlopen e urlretrieve sono trovato per eseguire relativamente male con il download di file di grandi dimensioni (dimensioni > 500 MB). requests.get memorizza i file in memoria fino a quando il download è completo.

import os,requests
def download(url):
    get_response = requests.get(url,stream=True)
    file_name  = url.split("/")[-1]
    with open(file_name, 'wb') as f:
        for chunk in get_response.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)


download("https://example.com/example.jpg")

È possibile ottenere i feedback di avanzamento con urlretrieve così:

def report(blocknr, blocksize, size):
    current = blocknr*blocksize
    sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))

def downloadFile(url):
    print "\n",url
    fname = url.split('/')[-1]
    print fname
    urllib.urlretrieve(url, fname, report)

Se si dispone di wget installato, è possibile utilizzare parallel_sync.

pip install parallel_sync

from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)

Doc:https://pythonhosted.org/parallel_sync/pages/examples.html

Questo è abbastanza potente.E ' possibile scaricare i file in parallelo, riprova in caso di fallimento , e è anche possibile scaricare i file su un computer remoto.

In python3 è possibile utilizzare urllib3 e shutil libraires.Scaricare utilizzando pip o pip3 (a Seconda se python3 è di default o non)

pip3 install urllib3 shutil

Quindi eseguire questo codice

import urllib.request
import shutil

url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

Nota che è possibile scaricare urllib3 ma utilizzare urllib nel codice

Se la velocità è importante per voi, ho fatto un piccolo test delle prestazioni per i moduli urllib e wget, e per quanto riguarda wget Ho provato una volta con la barra di stato e una volta senza.Ho preso tre diversi 500MB di file di prova con (diversi file per eliminare la possibilità che ci sia qualche cache succedendo sotto il cofano).Testato su debian macchina, con python2.

Prima di tutto, questi sono i risultati (che sono simili in diverse sedute):

$ python wget_test.py 
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============

Il modo che ho eseguito il test usa "profilo" decoratore.Questo è il codice completo:

import wget
import urllib
import time
from functools import wraps

def profile(func):
    @wraps(func)
    def inner(*args):
        print func.__name__, ": starting"
        start = time.time()
        ret = func(*args)
        end = time.time()
        print func.__name__, ": {:.2f}".format(end - start)
        return ret
    return inner

url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'

def do_nothing(*args):
    pass

@profile
def urlretrive_test(url):
    return urllib.urlretrieve(url)

@profile
def wget_no_bar_test(url):
    return wget.download(url, out='/tmp/', bar=do_nothing)

@profile
def wget_with_bar_test(url):
    return wget.download(url, out='/tmp/')

urlretrive_test(url1)
print '=============='
time.sleep(1)

wget_no_bar_test(url2)
print '=============='
time.sleep(1)

wget_with_bar_test(url3)
print '=============='
time.sleep(1)

urllib sembra essere il più veloce

Solo per completezza, è anche possibile chiamare qualsiasi programma per il recupero di file utilizzando il subprocess pacchetto.Programmi dedicati al recupero dei file sono più potenti rispetto a Python funzioni come urlretrieve.Per esempio, wget puoi scaricare le directory in modo ricorsivo-R), in grado di gestire FTP, reindirizzamenti HTTP proxy, può evitare di ri-scaricare i file esistenti (-nc), e aria2 può fare multi-connessione download potenzialmente in grado di velocizzare i tuoi download.

import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])

In Jupyter Notebook, si può anche chiamare direttamente i programmi con il ! sintassi:

!wget -O example_output_file.html https://example.com

Il codice sorgente può essere:

import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()                            
sock.close()                                        
print htmlSource

Ho scritto la seguente, che funziona in vanilla Python 2 o Python 3.

import sys
try:
    import urllib.request
    python3 = True
except ImportError:
    import urllib2
    python3 = False


def progress_callback_simple(downloaded,total):
    sys.stdout.write(
        "\r" +
        (len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
        " [%3.2f%%]"%(100.0*float(downloaded)/float(total))
    )
    sys.stdout.flush()

def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
    def _download_helper(response, out_file, file_size):
        if progress_callback!=None: progress_callback(0,file_size)
        if block_size == None:
            buffer = response.read()
            out_file.write(buffer)

            if progress_callback!=None: progress_callback(file_size,file_size)
        else:
            file_size_dl = 0
            while True:
                buffer = response.read(block_size)
                if not buffer: break

                file_size_dl += len(buffer)
                out_file.write(buffer)

                if progress_callback!=None: progress_callback(file_size_dl,file_size)
    with open(dstfilepath,"wb") as out_file:
        if python3:
            with urllib.request.urlopen(srcurl) as response:
                file_size = int(response.getheader("Content-Length"))
                _download_helper(response,out_file,file_size)
        else:
            response = urllib2.urlopen(srcurl)
            meta = response.info()
            file_size = int(meta.getheaders("Content-Length")[0])
            _download_helper(response,out_file,file_size)

import traceback
try:
    download(
        "https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
        "output.zip",
        progress_callback_simple
    )
except:
    traceback.print_exc()
    input()

Note:

Supporta una "barra di avanzamento" di callback.
Il Download è una 4 MB di prova .zip dal mio sito web.

È possibile utilizzare PycURL su Python 2 e 3.

import pycurl

FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'

with open(FILE_DEST, 'wb') as f:
    c = pycurl.Curl()
    c.setopt(c.URL, FILE_SRC)
    c.setopt(c.WRITEDATA, f)
    c.perform()
    c.close()

Questo può essere un po ' in ritardo, Ma ho visto pabloG codice e non poteva aiutare l'aggiunta di un os.system('cls') per farlo sembrare IMPRESSIONANTE!Check it out :

    import urllib2,os

    url = "http://download.thinkbroadband.com/10MB.zip"

    file_name = url.split('/')[-1]
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    meta = u.info()
    file_size = int(meta.getheaders("Content-Length")[0])
    print "Downloading: %s Bytes: %s" % (file_name, file_size)
    os.system('cls')
    file_size_dl = 0
    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        file_size_dl += len(buffer)
        f.write(buffer)
        status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
        status = status + chr(8)*(len(status)+1)
        print status,

    f.close()

Se si esegue in un ambiente diverso da quello di Windows, è necessario utilizzare qualcosa che 'cls'.In MAC OS X e Linux dovrebbe essere "in chiaro".

urlretrieve e richieste.ottenere sono semplici, tuttavia la realtà non.Ho recuperato i dati per un paio di siti, inclusi testo e immagini, sopra due probabilmente risolvere la maggior parte dei compiti.ma per una più universale soluzione suggerisco l'uso di urlopen.Quanto è incluso in Python 3, libreria standard, il codice può essere eseguito su qualsiasi computer che esegue Python 3 senza pre-installazione site-pacchetto

import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)

#remember to open file in bytes mode
with open(filename, 'wb') as f:
    while True:
        buffer = url_connect.read(buffer_size)
        if not buffer: break

        #an integer value of size of written data
        data_wrote = f.write(buffer)

#you could probably use with-open-as manner
url_connect.close()

Questa risposta fornisce una soluzione per HTTP 403-accesso negato durante il download di file tramite http, utilizzando Python.Io ho provato solo le richieste e urllib moduli, il modulo può fornire qualcosa di meglio, ma questo è quello che ho usato per risolvere la maggior parte dei problemi.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow