Question

I'm running into a bit of trouble with some code. Please bear in mind that I'm a terrible programmer, so my solution probably isn't very eloquent (and likely the reason why I'm running out of memory - I have 4 gigabytes and the script slowly fills it).

Here's the problem. I've got about 3,500 files in a directory. Each file consists of a single line that could have relatively few or many characters with no spaces (the smallest file is 200 bytes vs. the largest at 1.3 megabytes). What I am trying to do is find a common substring between these files two at a time of set length (in the code below it's 13 characters). I do two at a time because I'm not looking for a common substring among all of them, but rather combinations of two until all the files have been compared. I.e., any common substring of set length between the files, not a substring common to all of them.

I use a suffix tree module that wraps a C implementation (over here). First I make a list of all the files in the directory, then I look for combinations of two so that all combinations are covered, I pass two files at a time to the suffix tree and then look for sequences that are common substrings.

However, I don't really know why it's slowly running out of memory. I hope there's an amendment we can make to the code so that it clears out memory of unused stuff somehow? Obviously 3,500 files will take a long time to process, but I hope that it's possible to do without incrementally filling 4 gigabytes of memory. Any help would be greatly appreciated! Here is the code I've got so far:

from suffix_tree import GeneralisedSuffixTree
from itertools import combinations
import glob, hashlib, os

alist = open('tmp.adjlist', 'w')

def read_f(f):
    f = open(f, "r")
    s = str(f.readlines())
    f.close()
    return s

def read_gs(a,b):
    s1 = read_f(a)
    s2 = read_f(b)
    print str(a) + ":" + str(hashlib.md5(s1).hexdigest()) + " --- " + str(b) + ":" + str(hashlib.md5(s2).hexdigest())
    return [s1,s2]

def build_tree(s):
    hlist = []

    stree = GeneralisedSuffixTree(s)

    for shared in stree.sharedSubstrings(13):
        for seq,start,stop in shared:
            hlist.append(hashlib.md5(stree.sequences[seq]).hexdigest())
        hlist = list(set(hlist))
        for h in hlist:
            alist.write(str(h) + " ")
        alist.write('\n')

glist = []

for g in glob.glob("*.g"):
    glist.append(g)

for a,b in list(combinations(glist, 2)):
    s = read_gs(a,b)
    build_tree(s)

alist.close()

os.system("uniq tmp.adjlist network.adjlist && rm tmp.adjlist")

UPDATE #1

Here's the updated code. I added the suggestions Pyrce made. However, after jogojapan identified the memory leak in the C code, and given that it's way outside of my expertise, I ended up going with a much slower approach. If anyone is knowledgeable in this area, I'd be really curious to see how to modify the C code to fix the memory leak or the deallocation function, as I think a C suffix tree binding for Python is very valuable. It will probably take a few days to run the data through this script without a suffix tree, so I'm definitely open to seeing if anyone has a creative fix!

from itertools import combinations
import glob, hashlib, os

def read_f(f):
    with open(f, "r") as openf:
        s = str(openf.readlines())
    return s

def read_gs(a,b):
    s1 = read_f(a)
    s2 = read_f(b)
    print str(a) + ":" + str(hashlib.md5(s1).hexdigest()) + " --- " + str(b) + ":" + str(hashlib.md5(s2).hexdigest())
    return [s1,s2]

def lcs(S1, S2):
    M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
    longest, x_longest = 0, 0
    for x in xrange(1,1+len(S1)):
        for y in xrange(1,1+len(S2)):
            if S1[x-1] == S2[y-1]:
                M[x][y] = M[x-1][y-1] + 1
                if M[x][y]>longest:
                    longest = M[x][y]
                    x_longest  = x
            else:
                M[x][y] = 0
    return S1[x_longest-longest: x_longest]

glist = glob.glob("*.g")

for a,b in combinations(glist, 2):
    s = read_gs(a,b)
    p = lcs(s[0],s[1])
    if p != "" and len(p) >= 13:
        with open("tmp.adjlist", "a") as openf:
            openf.write(hashlib.md5(s[1]).hexdigest() + " " + hashlib.md5(s[0]).hexdigest() + "\n")

os.system("uniq tmp.adjlist network.adjlist && rm tmp.adjlist")
Was it helpful?

Solution

I am reasonably confident that there is a memory leak inside the suffix tree package you use.

Evidence 1: Running python inside valgrind, and then running this simple Python script

from suffix_trees import SuffixTree
t = SuffixTree("mississippi")
t = None

reported this leak:

==8800== 1,413 (32 direct, 1,381 indirect) bytes in 1 blocks are definitely lost in loss record 1,265 of 1,374
==8800==    at 0x4A0884D: malloc (vg_replace_malloc.c:263)
==8800==    by 0xBE70AEC: make_helper (suffix_tree.c:193)
==8800==    by 0xBE704B2: SuffixTree_init (python_bindings.c:240)
==8800==    by 0x3F98C9867B: ??? (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98C49A7D: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98CD71D6: PyEval_CallObjectWithKeywords (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98C5EB45: ??? (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98C49A7D: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98CD93F2: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98CDDB2E: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98C6D7B5: ??? (in /usr/lib64/libpython2.7.so.1.0)
==8800==    by 0x3F98C49A7D: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)

Evidence 2: When you look at the code in suffix_tree.c and python_bindings.c, you'll find the make_helper() function, which allocates memory for a suffix tree (using malloc), but there is not a single free anywhere in the code. I looked specifically at the allocation and deallocation functions defined for the Python types defined in python_bindings, but could not find any for the tree object. There is one for the node objects, but that only deallocates the Python wrapper part of the object, not the underlying structure in C.

Evidence 3: The data type definition for the Python object in python_bindings.c has a telling comment attached to it:

/* FIXME: deallocation of this guy! */
static PyTypeObject SuffixTreeType = {
    PyObject_HEAD_INIT(NULL)
    .tp_name      = "_suffix_tree.SuffixTree",
    /* ... */
    .tp_new       = SuffixTree_new,
};

Suggestions: You may want to contact the authors of the package and make them aware of the problem. According to information on the webpage, they are already working on a solution to the problem that ideally there should be a cyclic dependency between the tree itself and the node objects it contains -- that is a related problem, and probably the reason why the program does not currently perform any deallocation whatsoever.

Since you don't probably need the node-to-tree dependency for your purposes, you could also apply your own fix by adding a deallocation function to the tree object definition in python_bindings.c.

OTHER TIPS

I'm not 100% sure the logic of your original code does exactly what you intend -- and there is definitely a continuously growing list where I think you meant to reset. The variable hlist kept being added to without being in a loop on 'shared'. Making a set local to that loop fixed that problem. Additionally you don't need to listify a set of a list, just use a set to begin with and iterator over that set. I would suggest learning more about iterable objects in Python -- of which almost all Python data holding objects are (iterable). Basically you don't need to list() those objects unless you need a specific ordering and even then you usually just use sort().

The build_tree below addresses these issues and should reduce your memory footprint significantly -- and hopefully stop it from growing and growing.

def build_tree(s):
    stree = GeneralisedSuffixTree(s)

    for shared in stree.sharedSubstrings(13):
        hset = set()
        for seq,start,stop in shared:
            hset.add(hashlib.md5(stree.sequences[seq]).hexdigest())

        for h in hset:
            alist.write(str(h) + " ")
        alist.write('\n')
        # Request cleanup on finished data
        del hset
    # Request cleanup on finished data
    del stree

Also use the 'with' keyword when using files -- it guarantees the file will be closed when you're done -- where as open/close could barf the codebase later on if an exception is thrown.

def read_f(f):
    with open(f, "r") as openf:
        s = str(openf.readlines())
    return s

And as I mentioned above, remove the list() wrappers on all of the variables your getting back -- they are already iterable and doing list() on them can consume a TON of memory/run time for large iterable items. i.e:

for a,b in list(combinations(glist, 2)):

should be:

for a,b in combinations(glist, 2):

and

glist = []
for g in glob.glob("*.g"):
    glist.append(g)

should just become:

glist = glob.glob("*.g")

These changes should help, let me know if you still run out of memory, but you should now only grow in memory usage as alist gets bigger -- and that should flush once it gets too big. It is possible you have a memory leak from the C code as well (I haven't used that library), in which case you'll also have memory issues. But it's far more likely that your Python code was the culprit.

NOTE: I did not test the suggested changes I posted here, so there may be a minor syntax error present here and there.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top