Question

I am using the following script to create some rss snapshots (just saying).

The script runs on a backend and I am having some very heave ever increasing memory consumption.

class StartHandler(webapp2.RequestHandler):

    @ndb.toplevel
    def get(self):
        user_keys = User.query().fetch(1000, keys_only=True)
        if not user_keys:
            return
        logging.info("Starting Process of Users")
        successful_count = 0
        start_time = time.time()
        for user_key in user_keys:
            try:
                this_start_time = time.time()
                statssnapshot = StatsSnapShot(parent=user_key,
                                        property=get_rss(user_key.id())
                                        )
                #makes a urlfetch
                statssnapshot.put_async()
                successful_count += 1               
            except:
                pass
        logging.info("".join(("Processed: [",
                            str(successful_count),
                            "] users after [",
                            str(int(time.time()-start_time)),
                            "] secs")))
        return

EDIT

Here is also the rss functions lets say:

def get_rss(self, url):
        try:
            result = urlfetch.fetch(url)
            if not result.status_code == 200:
                logging.warning("Invalid URLfetch")
                return
        except urlfetch.Error, e:
            logging.warning("".join("Fetch Failed to get ",url," with",e))
            return
        content = result.content #Around 500 - 200KB
        reobj = re.compile(r'(?<=")[0-9]{21}(?=")')
        user_ids = reobj.findall(content)
        user_ids = set(user_ids)#set to fail if something is not unique
        return user_ids

The script runs ok, but as the Users become more the script consumes more and more memory. Coming from C I don't know how to manipulate memory and variables in Python that efficient.

For example I know that if a varible in python is not referenced again the garbage collector frees the memeory used for that variable, but then what seems to be my case and where am I doing it wrong?

How can I optimize this script not to have an ever increasing memory usage, but only consume the memory required for each user process?

Was it helpful?

Solution

NDB adds automatic caching, which is usually very convenient. You have in memory cache and memcached, and you can set policies for them.

When making a put, you can provide context options, and I suspect that the following would work for you:

statssnapshot.put_async(use_cache=False)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top