Working with suffix trees in python

https://stackoverflow.com/questions/10086558

30-05-2021
|

Question

I'm relatively new to python and am starting to work with suffix trees. I can build them, but I'm running into a memory issue when the string gets large. I know that they can be used to work with DNA strings of size 4^10 or 4^12, but whenever I try to implement a method, I end up with a memory issue.

Here is my code for generating the string and the suffix tree.

import random

def get_string(length):
    string=""
    for i in range(length):
        string += random.choice("ATGC")
    return string

word=get_string(4**4)+"$"

def suffixtree(string):
    for i in xrange(len(string)):
        if tree.has_key(string[i]):
            tree[string[i]].append([string[i+1:]][0])
        else:
            tree[string[i]]=[string[i+1:]]
    return tree

tree={}
suffixtree(word)

When I get up to around 4**8, I run into severe memory problems. I'm rather new to this so I'm sure I'm missing something with storing these things. Any advice would be greatly appreciated.

As a note: I want to do string searching to look for matching strings in a very large string. The search string match size is 16. So, this would look for a string of size 16 within a large string, and then move onto the next string and perform another search. Since I'll be doing a very large number of searches, a suffix tree was suggested.

Many thanks

Solution

As others have said already, the data structure you are building is not a suffix tree. However, the memory issues stem largely from the fact that your data structure involves a lot of explicit string copies. A call like this

string[i+1:]

creates an actual (deep) copy of the substring starting at i+1.

If you are still interested in constructing your original data structure (whatever its use may be), a good solution is to use buffers instead of string copies. Your algorithm would then look like this:

def suffixtree(string):
    N = len(string)
    for i in xrange(N):
        if tree.has_key(string[i]):
            tree[string[i]].append(buffer(string,i+1,N))
        else:
            tree[string[i]]=[buffer(string,i+1,N)]
    return tree

I tried this embedded in the rest of your code, and confirmed that it requires significantly less then 1 GB of main memory even at a total length of 8^11 characters.

Note that this will likely be relevant even if you switch to an actual suffix tree. A correct suffix tree implementation will not store copies (not even buffers) in the tree edges; however, during tree construction you might need a lot of temporary copies of the strings. Using the buffer type for these is a very good idea to avoid putting a heavy burden on the garbage collector for all the unnecessary explicit string copies.

OTHER TIPS

This doesn't look like a tree to me. It looks like you are generating all possible suffixes, and storing them in a hashtable.

You will likely get much smaller memory performance if you use an actual tree. I suggest using a library implementation.

The reason you get memory problems is that for input 'banana' you are generating {'b': ['anana$'], 'a': ['nana$', 'na$', '$'], 'n': ['ana$', 'a$']}. That isn't a tree structure. You have every possible suffix of the input created and stored in one of the lists. That takes O(n^2) storage space. Also, for a suffix tree to work properly, you want the leaf nodes to give you index positions.

The result you want to get is {'banana$': 0, 'a': {'$': 5, 'na': {'$': 3, 'na$': 1}}, 'na': {'$': 4, 'na$': 2}}. (This is an optimized representation; a simpler approach limits us to single-character labels.)

If your memory problems lie in creating the suffix tree, are you sure you need one? You could find all matches in a single string like this:

word=get_string(4**12)+"$"

def matcher(word, match_string):
    positions = [-1]
    while 1:
        positions.append(word.find(match_string, positions[-1] + 1))
        if positions[-1] == -1:
            return positions[1:-1]

print matcher(word,'AAAAAAAAAAAA')
[13331731, 13331732, 13331733]
print matcher('AACTATAAATTTACCA','AT')
[4, 8]

My machine is pretty old, and this took 30 secs to run, with 4^12 string. I used a 12 digit target so there would be some matches. Also this solution will find overlapping results - should there be any.

Here is a suffix tree module you could try, like this:

import suffixtree
stree = suffixtree.SuffixTree(word)
print stree.find_substring("AAAAAAAAAAAA")

Unfortunetly, my machine is too slow to test this out properly with long strings. But presumably once the suffixtree is built the searches will be very fast, so for large amounts of searches it should be a good call. Further find_substring only returns the first match (don't know if this is an issue, I'm sure you could adapt it easily).

Update: Split the string into smaller suffix trees, thus avoiding memory problems

So if you need to do 10 million searches on 4^12 length string, we clearly do not want to wait for 9.5 years (standard simple search, I first suggested, on my slow machine...). However, we can still use suffix trees (thus being a lot quicker), AND avoid the memory issues. Split the large string into manageable chunks (which we know the machines memory can cope with) and turn a chunk into a suffix tree, search it 10 million times, then discard that chunk and move onto the next one. We also need to remember to search the overlap between each chunk. I wrote some code to do this (It assumes the large string to be searched, word is a multiple of our maximum manageable string length, max_length, you'll have to adjust the code to also check the remainder at the end, if this is not the case):

def split_find(word,search_words,max_length):
    number_sub_trees = len(word)/max_length
    matches = {}
    for i in xrange(0,number_sub_trees):
        stree = suffixtree.SuffixTree(word[max_length*i:max_length*(i+1)])
        for search in search_words:
            if search not in matches:
                match = stree.find_substring(search)
                if match > -1:
                    matches[search] = match + max_length*i,i
            if i < number_sub_trees:
                match = word[max_length*(i+1) - len(search):max_length*(i+1) + len(search)].find(search)
                if match > -1:
                    matches[search] = match + max_length*i,i
    return matches

word=get_string(4**12)
search_words = ['AAAAAAAAAAAAAAAA'] #list of all words to find matches for
max_length = 4**10 #as large as your machine can cope with (multiple of word)
print split_find(word,search_words,max_length)

In this example I limit the max suffix tree length to length 4^10, which needs about 700MB. Using this code, for one 4^12 length string, 10 million searches should take around 13 hours (full searches, with zero matches, so if there are matches it will be quicker). However, as part of this we need to build 100 suffix trees, which will take around..100*41sec= 1 hour.

So the total time to run is around 14 hours, without memory issues... Big improvement on 9.5 years. Note that I am running this on a 1.6GHz CPU with 1GB RAM, so you ought to be able to do way better than this!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow