Caching dans urllib2?

https://stackoverflow.com/questions/148853

02-07-2019
|

Question

Existe-t-il un moyen simple de mettre en cache des éléments de mon contenu lorsque j'utilise urllib2 ou dois-je utiliser le mien?

La solution

Vous pouvez utiliser une fonction de décorateur telle que:

class cache(object):
    def __init__(self, fun):
        self.fun = fun
        self.cache = {}

    def __call__(self, *args, **kwargs):
        key  = str(args) + str(kwargs)
        try:
            return self.cache[key]
        except KeyError:
            self.cache[key] = rval = self.fun(*args, **kwargs)
            return rval
        except TypeError: # incase key isn't a valid key - don't cache
            return self.fun(*args, **kwargs)

et définissez une fonction comme suit:

@cache
def get_url_src(url):
    return urllib.urlopen(url).read()

Cela suppose que vous ne faites pas attention aux contrôles de cache HTTP, mais que vous souhaitiez simplement mettre la page en cache pendant la durée de l'application

Autres conseils

Si vous voulez bien travailler à un niveau légèrement inférieur, httplib2 ( https://github.com/ httplib2 / httplib2 ) est une excellente bibliothèque HTTP qui inclut une fonctionnalité de mise en cache.

Cette recette ActiveState Python pourrait être utile: http://code.activestate.com/recipes/491261/

J'ai toujours été partagé entre l'utilisation de httplib2, qui gère parfaitement la mise en cache et l'authentification HTTP, et urllib2, qui se trouve dans le fichier stdlib, possède une interface extensible et prend en charge les serveurs proxy HTTP.

La recette ActiveState commence à ajouter la prise en charge de la mise en cache à urllib2, mais uniquement de manière très primitive. mode. Il ne permet pas d'extensibilité dans les mécanismes de stockage, codant en dur le stockage sauvegardé par le système de fichiers. De plus, il ne respecte pas les en-têtes de cache HTTP.

Afin de réunir les meilleures fonctionnalités de mise en cache httplib2 et d'extensibilité urllib2, j'ai adapté la recette ActiveState pour implémenter la plupart des mêmes fonctionnalités de mise en cache que celles trouvées dans httplib2. Le module se trouve sur jaraco.net en tant que jaraco.net .http.caching . Le lien pointe vers le module tel qu'il existe au moment de l'écriture. Bien que ce module fasse actuellement partie du plus grand paquet jaraco.net, il n’a pas de dépendances intra-paquet, alors n'hésitez pas à le retirer et à l’utiliser dans vos propres projets.

Si vous utilisez Python 2.6 ou une version ultérieure, vous pouvez easy_install jaraco.net > = 1.3 , puis utiliser CachingHandler avec un code similaire à celui de caching.quick_test () .

"""Quick test/example of CacheHandler"""
import logging
import urllib2
from httplib2 import FileCache
from jaraco.net.http.caching import CacheHandler

logging.basicConfig(level=logging.DEBUG)
store = FileCache(".cache")
opener = urllib2.build_opener(CacheHandler(store))
urllib2.install_opener(opener)
response = opener.open("http://www.google.com/")
print response.headers
print "Response:", response.read()[:100], '...\n'

response.reload(store)
print response.headers
print "After reload:", response.read()[:100], '...\n'

Notez que jaraco.util.http.caching ne fournit pas de spécification pour le magasin de sauvegarde du cache, mais suit l'interface utilisée par httplib2. Pour cette raison, httplib2.FileCache peut être utilisé directement avec urllib2 et CacheHandler. De même, les autres caches de sauvegarde conçus pour httplib2 devraient pouvoir être utilisés par CacheHandler.

Je recherchais quelque chose de similaire et je suis tombé sur & Rec; Recette 491261: Mise en cache et étranglement pour urllib2 " ; que danivo a posté. Le problème est que je désapprouve vraiment le code de mise en cache (beaucoup de duplications, beaucoup de jonctions manuelles de chemins de fichiers au lieu d’utiliser os.path.join, utilise des méthodes static, des méthodes très peu PEP8, etc. J'essaie d'éviter)

Le code est un peu plus joli (à mon avis, de toute façon) et son fonctionnement est sensiblement le même, avec quelques ajouts - principalement le "recache" méthode (exemple d'utilisation peut être ici. dans la section if __name__ == " __ main __ ": à la fin du code).

La dernière version est disponible à l'adresse http://github.com/. dbr / tvdb_api / blob / master / cache.py , et je le collerai ici pour la postérité (avec les en-têtes spécifiques à mon application supprimés):

#!/usr/bin/env python
"""
urllib2 caching handler
Modified from http://code.activestate.com/recipes/491261/ by dbr
"""

import os
import time
import httplib
import urllib2
import StringIO
from hashlib import md5

def calculate_cache_path(cache_location, url):
    """Checks if [cache_location]/[hash_of_url].headers and .body exist
    """
    thumb = md5(url).hexdigest()
    header = os.path.join(cache_location, thumb + ".headers")
    body = os.path.join(cache_location, thumb + ".body")
    return header, body

def check_cache_time(path, max_age):
    """Checks if a file has been created/modified in the [last max_age] seconds.
    False means the file is too old (or doesn't exist), True means it is
    up-to-date and valid"""
    if not os.path.isfile(path):
        return False
    cache_modified_time = os.stat(path).st_mtime
    time_now = time.time()
    if cache_modified_time < time_now - max_age:
        # Cache is old
        return False
    else:
        return True

def exists_in_cache(cache_location, url, max_age):
    """Returns if header AND body cache file exist (and are up-to-date)"""
    hpath, bpath = calculate_cache_path(cache_location, url)
    if os.path.exists(hpath) and os.path.exists(bpath):
        return(
            check_cache_time(hpath, max_age)
            and check_cache_time(bpath, max_age)
        )
    else:
        # File does not exist
        return False

def store_in_cache(cache_location, url, response):
    """Tries to store response in cache."""
    hpath, bpath = calculate_cache_path(cache_location, url)
    try:
        outf = open(hpath, "w")
        headers = str(response.info())
        outf.write(headers)
        outf.close()

        outf = open(bpath, "w")
        outf.write(response.read())
        outf.close()
    except IOError:
        return True
    else:
        return False

class CacheHandler(urllib2.BaseHandler):
    """Stores responses in a persistant on-disk cache.

    If a subsequent GET request is made for the same URL, the stored
    response is returned, saving time, resources and bandwidth
    """
    def __init__(self, cache_location, max_age = 21600):
        """The location of the cache directory"""
        self.max_age = max_age
        self.cache_location = cache_location
        if not os.path.exists(self.cache_location):
            os.mkdir(self.cache_location)

    def default_open(self, request):
        """Handles GET requests, if the response is cached it returns it
        """
        if request.get_method() is not "GET":
            return None # let the next handler try to handle the request

        if exists_in_cache(
            self.cache_location, request.get_full_url(), self.max_age
        ):
            return CachedResponse(
                self.cache_location,
                request.get_full_url(),
                set_cache_header = True
            )
        else:
            return None

    def http_response(self, request, response):
        """Gets a HTTP response, if it was a GET request and the status code
        starts with 2 (200 OK etc) it caches it and returns a CachedResponse
        """
        if (request.get_method() == "GET"
            and str(response.code).startswith("2")
        ):
            if 'x-local-cache' not in response.info():
                # Response is not cached
                set_cache_header = store_in_cache(
                    self.cache_location,
                    request.get_full_url(),
                    response
                )
            else:
                set_cache_header = True
            #end if x-cache in response

            return CachedResponse(
                self.cache_location,
                request.get_full_url(),
                set_cache_header = set_cache_header
            )
        else:
            return response

class CachedResponse(StringIO.StringIO):
    """An urllib2.response-like object for cached responses.

    To determine if a response is cached or coming directly from
    the network, check the x-local-cache header rather than the object type.
    """
    def __init__(self, cache_location, url, set_cache_header=True):
        self.cache_location = cache_location
        hpath, bpath = calculate_cache_path(cache_location, url)

        StringIO.StringIO.__init__(self, file(bpath).read())

        self.url     = url
        self.code    = 200
        self.msg     = "OK"
        headerbuf = file(hpath).read()
        if set_cache_header:
            headerbuf += "x-local-cache: %s\r\n" % (bpath)
        self.headers = httplib.HTTPMessage(StringIO.StringIO(headerbuf))

    def info(self):
        """Returns headers
        """
        return self.headers

    def geturl(self):
        """Returns original URL
        """
        return self.url

    def recache(self):
        new_request = urllib2.urlopen(self.url)
        set_cache_header = store_in_cache(
            self.cache_location,
            new_request.url,
            new_request
        )
        CachedResponse.__init__(self, self.cache_location, self.url, True)


if __name__ == "__main__":
    def main():
        """Quick test/example of CacheHandler"""
        opener = urllib2.build_opener(CacheHandler("/tmp/"))
        response = opener.open("http://google.com")
        print response.headers
        print "Response:", response.read()

        response.recache()
        print response.headers
        print "After recache:", response.read()
    main()

Cet article sur Yahoo Developer Network - http://developer.yahoo.com/ python / python-caching.html - explique comment mettre en cache les appels http effectués via urllib dans la mémoire ou sur le disque.

@dbr: vous devrez peut-être également ajouter des réponses https en mémoire cache avec:

def https_response(self, request, response):
    return self.http_response(request,response)

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow