Memorizzazione nella cache in urllib2?

https://stackoverflow.com/questions/148853

02-07-2019
|

Domanda

Esiste un modo semplice per memorizzare nella cache cose quando uso urllib2 che sto trascurando, o devo fare il mio?

Soluzione

Puoi usare una funzione di decorazione come:

class cache(object):
    def __init__(self, fun):
        self.fun = fun
        self.cache = {}

    def __call__(self, *args, **kwargs):
        key  = str(args) + str(kwargs)
        try:
            return self.cache[key]
        except KeyError:
            self.cache[key] = rval = self.fun(*args, **kwargs)
            return rval
        except TypeError: # incase key isn't a valid key - don't cache
            return self.fun(*args, **kwargs)

e definisce una funzione lungo le linee di:

@cache
def get_url_src(url):
    return urllib.urlopen(url).read()

Questo presuppone che tu non stia prestando attenzione ai controlli cache HTTP, ma desideri solo memorizzare nella cache la pagina per la durata dell'applicazione

Altri suggerimenti

Se non ti dispiace lavorare a un livello leggermente inferiore, httplib2 ( https://github.com/ httplib2 / httplib2 ) è un'eccellente libreria HTTP che include funzionalità di memorizzazione nella cache.

Questa ricetta di ActiveState Python potrebbe essere utile: http://code.activestate.com/recipes/491261/

Sono sempre stato diviso tra l'utilizzo di httplib2, che fa un buon lavoro nella gestione della cache e dell'autenticazione HTTP, e urllib2, che si trova nello stdlib, ha un'interfaccia estensibile e supporta i server proxy HTTP.

La ricetta ActiveState inizia ad aggiungere il supporto di memorizzazione nella cache a urllib2, ma solo in una versione molto primitiva moda. Non è in grado di consentire l'estensibilità nei meccanismi di archiviazione, codificando a fondo l'archiviazione supportata dal file system. Inoltre, non rispetta le intestazioni della cache HTTP.

Nel tentativo di riunire le migliori funzionalità di memorizzazione nella cache httplib2 e di estensibilità urllib2, ho adattato la ricetta ActiveState per implementare la maggior parte delle stesse funzionalità di cache presenti in httplib2. Il modulo è in jaraco.net come jaraco.net .http.caching . Il collegamento punta al modulo così com'è al momento della stesura di questo documento. Sebbene quel modulo sia attualmente parte del pacchetto jaraco.net più grande, non ha dipendenze all'interno del pacchetto, quindi sentiti libero di estrarre il modulo e usarlo nei tuoi progetti.

In alternativa, se hai Python 2.6 o successivo, puoi easy_install jaraco.net > = 1.3 e quindi utilizzare CachingHandler con qualcosa come il codice in caching.quick_test () .

"""Quick test/example of CacheHandler"""
import logging
import urllib2
from httplib2 import FileCache
from jaraco.net.http.caching import CacheHandler

logging.basicConfig(level=logging.DEBUG)
store = FileCache(".cache")
opener = urllib2.build_opener(CacheHandler(store))
urllib2.install_opener(opener)
response = opener.open("http://www.google.com/")
print response.headers
print "Response:", response.read()[:100], '...\n'

response.reload(store)
print response.headers
print "After reload:", response.read()[:100], '...\n'

Notare che jaraco.util.http.caching non fornisce una specifica per l'archivio di backup per la cache, ma segue invece l'interfaccia utilizzata da httplib2. Per questo motivo, httplib2.FileCache può essere utilizzato direttamente con urllib2 e CacheHandler. Inoltre, altre cache di backup progettate per httplib2 dovrebbero essere utilizzabili da CacheHandler.

Stavo cercando qualcosa di simile e mi sono imbattuto in " Ricetta 491261: memorizzazione nella cache e limitazione per urllib2 " ; pubblicato da danivo. Il problema è che davvero non mi piace il codice di memorizzazione nella cache (molta duplicazione, molta unione manuale dei percorsi dei file invece di usare os.path.join, usa metodi statici, non molto PEP8'sih e altre cose che Cerco di evitare)

Il codice è un po 'più bello (secondo me comunque) ed è funzionalmente più o meno lo stesso, con alcune aggiunte - principalmente il "recache" metodo (esempio di utilizzo oppure può essere qui se __name__ == " __ main __ " ;: alla fine del codice).

L'ultima versione è disponibile all'indirizzo http://github.com/ dbr / tvdb_api / blob / master / cache.py e lo incollerò qui per i posteri (con le intestazioni specifiche della mia applicazione rimosse):

#!/usr/bin/env python
"""
urllib2 caching handler
Modified from http://code.activestate.com/recipes/491261/ by dbr
"""

import os
import time
import httplib
import urllib2
import StringIO
from hashlib import md5

def calculate_cache_path(cache_location, url):
    """Checks if [cache_location]/[hash_of_url].headers and .body exist
    """
    thumb = md5(url).hexdigest()
    header = os.path.join(cache_location, thumb + ".headers")
    body = os.path.join(cache_location, thumb + ".body")
    return header, body

def check_cache_time(path, max_age):
    """Checks if a file has been created/modified in the [last max_age] seconds.
    False means the file is too old (or doesn't exist), True means it is
    up-to-date and valid"""
    if not os.path.isfile(path):
        return False
    cache_modified_time = os.stat(path).st_mtime
    time_now = time.time()
    if cache_modified_time < time_now - max_age:
        # Cache is old
        return False
    else:
        return True

def exists_in_cache(cache_location, url, max_age):
    """Returns if header AND body cache file exist (and are up-to-date)"""
    hpath, bpath = calculate_cache_path(cache_location, url)
    if os.path.exists(hpath) and os.path.exists(bpath):
        return(
            check_cache_time(hpath, max_age)
            and check_cache_time(bpath, max_age)
        )
    else:
        # File does not exist
        return False

def store_in_cache(cache_location, url, response):
    """Tries to store response in cache."""
    hpath, bpath = calculate_cache_path(cache_location, url)
    try:
        outf = open(hpath, "w")
        headers = str(response.info())
        outf.write(headers)
        outf.close()

        outf = open(bpath, "w")
        outf.write(response.read())
        outf.close()
    except IOError:
        return True
    else:
        return False

class CacheHandler(urllib2.BaseHandler):
    """Stores responses in a persistant on-disk cache.

    If a subsequent GET request is made for the same URL, the stored
    response is returned, saving time, resources and bandwidth
    """
    def __init__(self, cache_location, max_age = 21600):
        """The location of the cache directory"""
        self.max_age = max_age
        self.cache_location = cache_location
        if not os.path.exists(self.cache_location):
            os.mkdir(self.cache_location)

    def default_open(self, request):
        """Handles GET requests, if the response is cached it returns it
        """
        if request.get_method() is not "GET":
            return None # let the next handler try to handle the request

        if exists_in_cache(
            self.cache_location, request.get_full_url(), self.max_age
        ):
            return CachedResponse(
                self.cache_location,
                request.get_full_url(),
                set_cache_header = True
            )
        else:
            return None

    def http_response(self, request, response):
        """Gets a HTTP response, if it was a GET request and the status code
        starts with 2 (200 OK etc) it caches it and returns a CachedResponse
        """
        if (request.get_method() == "GET"
            and str(response.code).startswith("2")
        ):
            if 'x-local-cache' not in response.info():
                # Response is not cached
                set_cache_header = store_in_cache(
                    self.cache_location,
                    request.get_full_url(),
                    response
                )
            else:
                set_cache_header = True
            #end if x-cache in response

            return CachedResponse(
                self.cache_location,
                request.get_full_url(),
                set_cache_header = set_cache_header
            )
        else:
            return response

class CachedResponse(StringIO.StringIO):
    """An urllib2.response-like object for cached responses.

    To determine if a response is cached or coming directly from
    the network, check the x-local-cache header rather than the object type.
    """
    def __init__(self, cache_location, url, set_cache_header=True):
        self.cache_location = cache_location
        hpath, bpath = calculate_cache_path(cache_location, url)

        StringIO.StringIO.__init__(self, file(bpath).read())

        self.url     = url
        self.code    = 200
        self.msg     = "OK"
        headerbuf = file(hpath).read()
        if set_cache_header:
            headerbuf += "x-local-cache: %s\r\n" % (bpath)
        self.headers = httplib.HTTPMessage(StringIO.StringIO(headerbuf))

    def info(self):
        """Returns headers
        """
        return self.headers

    def geturl(self):
        """Returns original URL
        """
        return self.url

    def recache(self):
        new_request = urllib2.urlopen(self.url)
        set_cache_header = store_in_cache(
            self.cache_location,
            new_request.url,
            new_request
        )
        CachedResponse.__init__(self, self.cache_location, self.url, True)


if __name__ == "__main__":
    def main():
        """Quick test/example of CacheHandler"""
        opener = urllib2.build_opener(CacheHandler("/tmp/"))
        response = opener.open("http://google.com")
        print response.headers
        print "Response:", response.read()

        response.recache()
        print response.headers
        print "After recache:", response.read()
    main()

Questo articolo su Yahoo Developer Network - http://developer.yahoo.com/ python / python-caching.html - descrive come memorizzare nella cache le chiamate http effettuate tramite urllib sulla memoria o sul disco.

@dbr: potrebbe essere necessario aggiungere anche risposte https memorizzate nella cache con:

def https_response(self, request, response):
    return self.http_response(request,response)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow