Domanda

I am using the following script to create some rss snapshots (just saying).

The script runs on a backend and I am having some very heave ever increasing memory consumption.

class StartHandler(webapp2.RequestHandler):

    @ndb.toplevel
    def get(self):
        user_keys = User.query().fetch(1000, keys_only=True)
        if not user_keys:
            return
        logging.info("Starting Process of Users")
        successful_count = 0
        start_time = time.time()
        for user_key in user_keys:
            try:
                this_start_time = time.time()
                statssnapshot = StatsSnapShot(parent=user_key,
                                        property=get_rss(user_key.id())
                                        )
                #makes a urlfetch
                statssnapshot.put_async()
                successful_count += 1               
            except:
                pass
        logging.info("".join(("Processed: [",
                            str(successful_count),
                            "] users after [",
                            str(int(time.time()-start_time)),
                            "] secs")))
        return

EDIT

Here is also the rss functions lets say:

def get_rss(self, url):
        try:
            result = urlfetch.fetch(url)
            if not result.status_code == 200:
                logging.warning("Invalid URLfetch")
                return
        except urlfetch.Error, e:
            logging.warning("".join("Fetch Failed to get ",url," with",e))
            return
        content = result.content #Around 500 - 200KB
        reobj = re.compile(r'(?<=")[0-9]{21}(?=")')
        user_ids = reobj.findall(content)
        user_ids = set(user_ids)#set to fail if something is not unique
        return user_ids

The script runs ok, but as the Users become more the script consumes more and more memory. Coming from C I don't know how to manipulate memory and variables in Python that efficient.

For example I know that if a varible in python is not referenced again the garbage collector frees the memeory used for that variable, but then what seems to be my case and where am I doing it wrong?

How can I optimize this script not to have an ever increasing memory usage, but only consume the memory required for each user process?

È stato utile?

Soluzione

NDB adds automatic caching, which is usually very convenient. You have in memory cache and memcached, and you can set policies for them.

When making a put, you can provide context options, and I suspect that the following would work for you:

statssnapshot.put_async(use_cache=False)
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top