¿Caching en urllib2?

https://stackoverflow.com/questions/148853

02-07-2019
|

Pregunta

¿Hay una manera fácil de almacenar cosas en caché cuando uso urllib2 que estoy pasando por alto, o tengo que rodar el mío?

Solución

Puede usar una función decoradora como:

class cache(object):
    def __init__(self, fun):
        self.fun = fun
        self.cache = {}

    def __call__(self, *args, **kwargs):
        key  = str(args) + str(kwargs)
        try:
            return self.cache[key]
        except KeyError:
            self.cache[key] = rval = self.fun(*args, **kwargs)
            return rval
        except TypeError: # incase key isn't a valid key - don't cache
            return self.fun(*args, **kwargs)

y define una función a lo largo de las líneas de:

@cache
def get_url_src(url):
    return urllib.urlopen(url).read()

Esto supone que no está prestando atención a los controles de caché HTTP, sino que solo desea almacenar en caché la página mientras dure la aplicación

Otros consejos

Si no le importa trabajar en un nivel ligeramente inferior, httplib2 ( https://github.com/ httplib2 / httplib2 ) es una excelente biblioteca HTTP que incluye la funcionalidad de almacenamiento en caché.

Esta receta de ActiveState Python podría ser útil: http://code.activestate.com/recipes/491261/

Siempre he estado dividido entre el uso de httplib2, que hace un trabajo sólido de manejo del almacenamiento en caché y la autenticación HTTP, y urllib2, que está en stdlib, tiene una interfaz extensible y es compatible con los servidores proxy HTTP.

La receta ActiveState comienza a agregar soporte de almacenamiento en caché a urllib2, pero solo de forma muy primitiva Moda. No permite la extensibilidad en los mecanismos de almacenamiento, codificando el almacenamiento respaldado por el sistema de archivos. Tampoco respeta los encabezados de caché HTTP.

En un intento de reunir las mejores características del almacenamiento en caché de httplib2 y la extensibilidad de urllib2, he adaptado la receta de ActiveState para implementar la mayor parte de la misma funcionalidad de almacenamiento en caché que se encuentra en httplib2. El módulo está en jaraco.net como jaraco.net .http.caching . El enlace apunta al módulo tal como existe en el momento de esta escritura. Si bien ese módulo actualmente forma parte del paquete jaraco.net más grande, no tiene dependencias dentro del paquete, así que siéntase libre de extraer el módulo y usarlo en sus propios proyectos.

Alternativamente, si tiene Python 2.6 o posterior, puede easy_install jaraco.net > = 1.3 y luego utilizar CachingHandler con algo como el código en caching.quick_test () .



"""Quick test/example of CacheHandler"""
import logging
import urllib2
from httplib2 import FileCache
from jaraco.net.http.caching import CacheHandler

logging.basicConfig(level=logging.DEBUG)
store = FileCache(".cache")
opener = urllib2.build_opener(CacheHandler(store))
urllib2.install_opener(opener)
response = opener.open("http://www.google.com/")
print response.headers
print "Response:", response.read()[:100], '...\n'

response.reload(store)
print response.headers
print "After reload:", response.read()[:100], '...\n'


 Tenga en cuenta que jaraco.util.http.caching no proporciona una especificación para el almacén de respaldo para el caché, sino que sigue la interfaz utilizada por httplib2. Por esta razón, httplib2.FileCache se puede usar directamente con urllib2 y CacheHandler. Además, otros cachés de respaldo diseñados para httplib2 deben ser utilizables por CacheHandler.



	
		
	
	
			 Estaba buscando algo similar, y encontré  " Receta 491261: Almacenamiento en caché y aceleración de urllib2 " ;  que danivo publicó. El problema es que  realmente  no me gusta el código de almacenamiento en caché (mucha duplicación, muchas uniones manuales de archivos en lugar de usar os.path.join, utiliza métodos estáticos, no muy PEP8'sih, y otras cosas que Trato de evitar) 

 El código es un poco más agradable (en mi opinión, de todos modos) y es funcionalmente el mismo, con algunas adiciones, principalmente el "recache". Método (ejemplo de uso  http://github.com/ dbr / tvdb_api / blob / master / cache.py , y lo pegaré aquí para la posteridad (con los encabezados específicos de mi aplicación eliminados): 

#!/usr/bin/env python
"""
urllib2 caching handler
Modified from http://code.activestate.com/recipes/491261/ by dbr
"""

import os
import time
import httplib
import urllib2
import StringIO
from hashlib import md5

def calculate_cache_path(cache_location, url):
    """Checks if [cache_location]/[hash_of_url].headers and .body exist
    """
    thumb = md5(url).hexdigest()
    header = os.path.join(cache_location, thumb + ".headers")
    body = os.path.join(cache_location, thumb + ".body")
    return header, body

def check_cache_time(path, max_age):
    """Checks if a file has been created/modified in the [last max_age] seconds.
    False means the file is too old (or doesn't exist), True means it is
    up-to-date and valid"""
    if not os.path.isfile(path):
        return False
    cache_modified_time = os.stat(path).st_mtime
    time_now = time.time()
    if cache_modified_time < time_now - max_age:
        # Cache is old
        return False
    else:
        return True

def exists_in_cache(cache_location, url, max_age):
    """Returns if header AND body cache file exist (and are up-to-date)"""
    hpath, bpath = calculate_cache_path(cache_location, url)
    if os.path.exists(hpath) and os.path.exists(bpath):
        return(
            check_cache_time(hpath, max_age)
            and check_cache_time(bpath, max_age)
        )
    else:
        # File does not exist
        return False

def store_in_cache(cache_location, url, response):
    """Tries to store response in cache."""
    hpath, bpath = calculate_cache_path(cache_location, url)
    try:
        outf = open(hpath, "w")
        headers = str(response.info())
        outf.write(headers)
        outf.close()

        outf = open(bpath, "w")
        outf.write(response.read())
        outf.close()
    except IOError:
        return True
    else:
        return False

class CacheHandler(urllib2.BaseHandler):
    """Stores responses in a persistant on-disk cache.

    If a subsequent GET request is made for the same URL, the stored
    response is returned, saving time, resources and bandwidth
    """
    def __init__(self, cache_location, max_age = 21600):
        """The location of the cache directory"""
        self.max_age = max_age
        self.cache_location = cache_location
        if not os.path.exists(self.cache_location):
            os.mkdir(self.cache_location)

    def default_open(self, request):
        """Handles GET requests, if the response is cached it returns it
        """
        if request.get_method() is not "GET":
            return None # let the next handler try to handle the request

        if exists_in_cache(
            self.cache_location, request.get_full_url(), self.max_age
        ):
            return CachedResponse(
                self.cache_location,
                request.get_full_url(),
                set_cache_header = True
            )
        else:
            return None

    def http_response(self, request, response):
        """Gets a HTTP response, if it was a GET request and the status code
        starts with 2 (200 OK etc) it caches it and returns a CachedResponse
        """
        if (request.get_method() == "GET"
            and str(response.code).startswith("2")
        ):
            if 'x-local-cache' not in response.info():
                # Response is not cached
                set_cache_header = store_in_cache(
                    self.cache_location,
                    request.get_full_url(),
                    response
                )
            else:
                set_cache_header = True
            #end if x-cache in response

            return CachedResponse(
                self.cache_location,
                request.get_full_url(),
                set_cache_header = set_cache_header
            )
        else:
            return response

class CachedResponse(StringIO.StringIO):
    """An urllib2.response-like object for cached responses.

    To determine if a response is cached or coming directly from
    the network, check the x-local-cache header rather than the object type.
    """
    def __init__(self, cache_location, url, set_cache_header=True):
        self.cache_location = cache_location
        hpath, bpath = calculate_cache_path(cache_location, url)

        StringIO.StringIO.__init__(self, file(bpath).read())

        self.url     = url
        self.code    = 200
        self.msg     = "OK"
        headerbuf = file(hpath).read()
        if set_cache_header:
            headerbuf += "x-local-cache: %s\r\n" % (bpath)
        self.headers = httplib.HTTPMessage(StringIO.StringIO(headerbuf))

    def info(self):
        """Returns headers
        """
        return self.headers

    def geturl(self):
        """Returns original URL
        """
        return self.url

    def recache(self):
        new_request = urllib2.urlopen(self.url)
        set_cache_header = store_in_cache(
            self.cache_location,
            new_request.url,
            new_request
        )
        CachedResponse.__init__(self, self.cache_location, self.url, True)


if __name__ == "__main__":
    def main():
        """Quick test/example of CacheHandler"""
        opener = urllib2.build_opener(CacheHandler("/tmp/"))
        response = opener.open("http://google.com")
        print response.headers
        print "Response:", response.read()

        response.recache()
        print response.headers
        print "After recache:", response.read()
    main()
	


	
		
	
	
			 Este artículo en Yahoo Developer Network -  http://developer.yahoo.com/ python / python-caching.html : describe cómo almacenar en caché las llamadas http hechas a través de urllib a la memoria o al disco. 
	


	
		
	
	
			 @dbr: es posible que deba agregar también el almacenamiento en caché de respuestas https con: 

def https_response(self, request, response):
    return self.http_response(request,response)



	
		
			Licenciado bajo: CC-BY-SA con atribución
			No afiliado a StackOverflow