Question

Part of my application uses a trie to chunk words together. For example, ["Summer", "in", "Los", "Angeles"] becomes ["Summer", "in", "Los Angeles"].

Now, this trie gets populated from a large database, stored locally as SQL, on application startup. It takes a long time, around 15s. I'd like to reduce the application start-up time, so I've looked at serializing the Trie. Unfortunately, pickling is too slow - slower than loading everything from the database.

Is there a faster way to serialize my trie?

Here's what the Trie class looks like:

class Trie:
    def __init__(self):
        self.values = set()
        self.children = dict()

    def insert(self, key, value):
        """Insert a (key,value) pair into the trie.  
        The key should be a list of strings.
        The value can be of arbitrary type."""
        current_node = self
        for key_part in key:
            if key_part not in current_node.children:
                current_node.children[key_part] = Trie()
            current_node = current_node.children[key_part]
        current_node.values.add(value)

    def retrieve(self, key):
        """Returns either the value stored at the key, or raises KeyError."""
        current_node = self
        for key_part in key:
            current_node = current_node.children[key_part]
        return current_node.values

Is there any way of changing it that would make it more serializable?

Was it helpful?

Solution 2

I ended up storing the trie in MongoDB.

There is a network overhead, but provided the database is on localhost it isn't too bad.

OTHER TIPS

I know I am not giving a python answer but still this might be useful:

Creating, compressing and storing a trie is really a hard task. I spend quite some time thinking about data structures for auto suggestions and to the best of my knowledge the most elegant solution is provided by Giuseppe Ottaviano and partly described in my blog article

Even though it won't make sense to implement the entire solution of Ottaviano as described in his paper in python it still might make sense for you to follow the basic idea to store the complete trie as one large block of memory and only have references to the position where to jump next.

In this way you can easily serialize this array or block of memory to the hard disk. I am not completely sure about python but I think this operation should work and also be much faster than serializing a data structure.

I know that a c implementation of Ottavianos work exists you could even use the python c bindings.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top