Question

I know that python has its own memory management implementation using areans for objects of different sizes and much more, although I haven't found a thorough documentation yet. Still I'd like to understand whats happening under the hood.

The background is a long-running python2 database application that somehow appears to leak memory, it is running on 64bit linux. Every day this application would read some data from the DB, it sums up to ~3.5GB RAM usage just for reading the rows (using MySQLdb). There are about 3.5M rows which are reduced afterwards to a few 100 rows and the rest is running out of scope ("freed").

But python-2.7 only frees a small fraction of the now "unused" memory. I'm awar that the memory is reused later, but I have observed that somehow this memory seems to "slowly leak". The mentioned DB application reads this huge chunk of data every day. Reading it twice (or more times) in a row only allocates memory for the first read, then apparently reuses this memory. But letting it run for a couple of hours and then reading DB data again produces the next 3+GB peak of memory allocation (which again is never freed).

To add some more background (and make things worse to explain) I've got to say that this DB application is not idle but permanently performs tasks. I am pretty sure from monitoring the memory usage (nagios performance data) that memory usage never climbs to 3.5GB RAM (or even close) without this particular DB query. But having this query enabled adds 3+GB RAM every day. The query in question returns mostly unique integers and floats.

This is the main reason why I started to suspect python in general. I feel like I've read tons of information, looked at _PyObject_DebugMallocStats() but have no clue what (or why) python decides to keep a couple of Gigabytes.

It boils down to a very simple example (not representing the real life situation regarding the data, I know about xrange()):

def mem_usage(pid=None):
    mem = 0
    proc = str(pid or "self")
    with open("/proc/%s/smaps" % proc) as fstat:
        for l in fstat:
            if not l.startswith("Private_"):
                continue
            mem += int(l.split(":", 1)[1].strip().split(" ", 1)[0])
    return mem

mem_usage()                 # reports a few MB
x = list(range(100000000))  # use list() for py3k
mem_usage()                 # reports ~3GB
del x
mem_usage()                 # reports ~2.5GB

Whats interesting is that py3k frees memory when I delete the huge list. Not only a fraction but almost all leaving a memory usage only slightly higher than at the beginning.

I've investigated this with memory_profiler (I guess it is not doing much more than the given mem_usage() function) without any insight. I've read about gdb-heap but couldn't get it working so far.

I actually do not believe that there is a solution (other than restarting the application or reducing the amount of data read from the DB). But I'd really appreciate any insights on this topic.

EDIT:

To summarize my question: Why is python-2.7 keeping this memory allocated?

Was it helpful?

Solution

The range example keeps a ton of memory around because Python 2.7 never frees ints:

block_list is a singly-linked list of all PyIntBlocks ever allocated, linked via their next members. PyIntBlocks are never returned to the system before shutdown (PyInt_Fini).

However, this should not be a problem unless at some point, several gigabytes of ints are alive at the same time. Otherwise, Python will use old, discarded ints to represent any new ones you use. If you do have several gigabytes worth of live ints, I recommend finding a way to keep less of them around at a time.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top