在使用urllib2时,是否有一种简单的方法可以缓存我的内容,或者我是否需要自己滚动?

有帮助吗?

解决方案

您可以使用装饰器功能,例如:

class cache(object):
    def __init__(self, fun):
        self.fun = fun
        self.cache = {}

    def __call__(self, *args, **kwargs):
        key  = str(args) + str(kwargs)
        try:
            return self.cache[key]
        except KeyError:
            self.cache[key] = rval = self.fun(*args, **kwargs)
            return rval
        except TypeError: # incase key isn't a valid key - don't cache
            return self.fun(*args, **kwargs)

并定义一个函数:

@cache
def get_url_src(url):
    return urllib.urlopen(url).read()

这假设您没有关注HTTP缓存控件,只是想在应用程序的持续时间内缓存页面

其他提示

如果您不介意在较低级别工作,httplib2( https://github.com/ httplib2 / httplib2 )是一个出色的HTTP库,包括缓存功能。

这个ActiveState Python配方可能会有所帮助: http://code.activestate.com/recipes/491261/

我总是在使用httplib2之间徘徊,httplib2可以很好地处理HTTP缓存和身份验证,urllib2位于stdlib中,具有可扩展的接口,并且支持HTTP代理服务器。

ActiveState配方开始为urllib2添加缓存支持,但仅限于非常原始的时尚。它无法允许存储机制的可扩展性,硬编码文件系统支持的存储。它也不支持HTTP缓存标头。

为了汇集httplib2缓存和urllib2可扩展性的最佳功能,我调整了ActiveState配方来实现与httplib2中相同的大多数缓存功能。该模块位于jaraco.net,如 jaraco.net .http.caching 。链接指向模块,因为它在撰写本文时存在。虽然该模块目前是较大的jaraco.net软件包的一部分,但它没有内部包依赖关系,因此请随意将模块拉出并在您自己的项目中使用它。

或者,如果你有Python 2.6或更高版本,你可以 easy_install jaraco.net> = 1.3 ,然后使用CachingHandler,类似 caching.quick_test()中的代码

"""Quick test/example of CacheHandler"""
import logging
import urllib2
from httplib2 import FileCache
from jaraco.net.http.caching import CacheHandler

logging.basicConfig(level=logging.DEBUG)
store = FileCache(".cache")
opener = urllib2.build_opener(CacheHandler(store))
urllib2.install_opener(opener)
response = opener.open("http://www.google.com/")
print response.headers
print "Response:", response.read()[:100], '...\n'

response.reload(store)
print response.headers
print "After reload:", response.read()[:100], '...\n'

请注意,jaraco.util.http.caching不提供缓存的后备存储的规范,而是遵循httplib2使用的接口。因此,httplib2.FileCache可以直接与urllib2和CacheHandler一起使用。此外,为httplib2设计的其他后备缓存应该可由CacheHandler使用。

我正在寻找类似的东西,并遇到了" Recipe 491261:urllib2的缓存和限制“ ; danivo发布的内容。问题是我真的不喜欢缓存代码(大量重复,大量手动连接文件路径而不是使用os.path.join,使用staticmethods,非常非常PEP8'sih,以及其他我试着避免)

代码有点好(在我看来无论如何)并且在功能上大致相同,只有一些补充 - 主要是“recache”。方法(示例用法可以在这里看到,或者在代码末尾的 if __name__ ==" __ main __":部分。)

最新版本可在 http://github.com/找到dbr / tvdb_api / blob / master / cache.py ,我会将其粘贴到后代(删除我的应用程序特定标题):

#!/usr/bin/env python
"""
urllib2 caching handler
Modified from http://code.activestate.com/recipes/491261/ by dbr
"""

import os
import time
import httplib
import urllib2
import StringIO
from hashlib import md5

def calculate_cache_path(cache_location, url):
    """Checks if [cache_location]/[hash_of_url].headers and .body exist
    """
    thumb = md5(url).hexdigest()
    header = os.path.join(cache_location, thumb + ".headers")
    body = os.path.join(cache_location, thumb + ".body")
    return header, body

def check_cache_time(path, max_age):
    """Checks if a file has been created/modified in the [last max_age] seconds.
    False means the file is too old (or doesn't exist), True means it is
    up-to-date and valid"""
    if not os.path.isfile(path):
        return False
    cache_modified_time = os.stat(path).st_mtime
    time_now = time.time()
    if cache_modified_time < time_now - max_age:
        # Cache is old
        return False
    else:
        return True

def exists_in_cache(cache_location, url, max_age):
    """Returns if header AND body cache file exist (and are up-to-date)"""
    hpath, bpath = calculate_cache_path(cache_location, url)
    if os.path.exists(hpath) and os.path.exists(bpath):
        return(
            check_cache_time(hpath, max_age)
            and check_cache_time(bpath, max_age)
        )
    else:
        # File does not exist
        return False

def store_in_cache(cache_location, url, response):
    """Tries to store response in cache."""
    hpath, bpath = calculate_cache_path(cache_location, url)
    try:
        outf = open(hpath, "w")
        headers = str(response.info())
        outf.write(headers)
        outf.close()

        outf = open(bpath, "w")
        outf.write(response.read())
        outf.close()
    except IOError:
        return True
    else:
        return False

class CacheHandler(urllib2.BaseHandler):
    """Stores responses in a persistant on-disk cache.

    If a subsequent GET request is made for the same URL, the stored
    response is returned, saving time, resources and bandwidth
    """
    def __init__(self, cache_location, max_age = 21600):
        """The location of the cache directory"""
        self.max_age = max_age
        self.cache_location = cache_location
        if not os.path.exists(self.cache_location):
            os.mkdir(self.cache_location)

    def default_open(self, request):
        """Handles GET requests, if the response is cached it returns it
        """
        if request.get_method() is not "GET":
            return None # let the next handler try to handle the request

        if exists_in_cache(
            self.cache_location, request.get_full_url(), self.max_age
        ):
            return CachedResponse(
                self.cache_location,
                request.get_full_url(),
                set_cache_header = True
            )
        else:
            return None

    def http_response(self, request, response):
        """Gets a HTTP response, if it was a GET request and the status code
        starts with 2 (200 OK etc) it caches it and returns a CachedResponse
        """
        if (request.get_method() == "GET"
            and str(response.code).startswith("2")
        ):
            if 'x-local-cache' not in response.info():
                # Response is not cached
                set_cache_header = store_in_cache(
                    self.cache_location,
                    request.get_full_url(),
                    response
                )
            else:
                set_cache_header = True
            #end if x-cache in response

            return CachedResponse(
                self.cache_location,
                request.get_full_url(),
                set_cache_header = set_cache_header
            )
        else:
            return response

class CachedResponse(StringIO.StringIO):
    """An urllib2.response-like object for cached responses.

    To determine if a response is cached or coming directly from
    the network, check the x-local-cache header rather than the object type.
    """
    def __init__(self, cache_location, url, set_cache_header=True):
        self.cache_location = cache_location
        hpath, bpath = calculate_cache_path(cache_location, url)

        StringIO.StringIO.__init__(self, file(bpath).read())

        self.url     = url
        self.code    = 200
        self.msg     = "OK"
        headerbuf = file(hpath).read()
        if set_cache_header:
            headerbuf += "x-local-cache: %s\r\n" % (bpath)
        self.headers = httplib.HTTPMessage(StringIO.StringIO(headerbuf))

    def info(self):
        """Returns headers
        """
        return self.headers

    def geturl(self):
        """Returns original URL
        """
        return self.url

    def recache(self):
        new_request = urllib2.urlopen(self.url)
        set_cache_header = store_in_cache(
            self.cache_location,
            new_request.url,
            new_request
        )
        CachedResponse.__init__(self, self.cache_location, self.url, True)


if __name__ == "__main__":
    def main():
        """Quick test/example of CacheHandler"""
        opener = urllib2.build_opener(CacheHandler("/tmp/"))
        response = opener.open("http://google.com")
        print response.headers
        print "Response:", response.read()

        response.recache()
        print response.headers
        print "After recache:", response.read()
    main()

雅虎开发者网络上的这篇文章 - http://developer.yahoo.com/ python / python-caching.html - 介绍如何将通过urllib进行的http调用缓存到内存或磁盘。

@dbr:您可能还需要添加https响应缓存:

def https_response(self, request, response):
    return self.http_response(request,response)
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top