Store/retrieve a data structure

https://stackoverflow.com/questions/8300364

22-10-2019
|

문제

I have implemented a suffix tree in Python to make full-text-searchs, and it's working really well. But there's a problem: the indexed text can be very big, so we won't be able to have the whole structure in RAM.

enter image description here

IMAGE: Suffix tree for the word BANANAS (in my scenario, imagine a tree 100000 times bigger).

So, researching a little bit about it I found the pickle module, a great Python module for "loading" and "dumping" objects from/into files, and guess what? It works wonderfully with my data structure.

So, making the long story shorter: What would be the best strategy to store and retrieve this structure on/from disk? I mean, a solution could be to store each node in a file and load it from disk whenever is needed, but this isn't the best think to do (too many disk accesses).

Footnote: Although I have tagged this question as python, the programming language isn't the important part of the question, the disk storing/retrieving strategy is really the main point.

해결책

If pickle is already working for you, you may want to take a look at ZODB which adds some functionality on top of pickle. Looking at the documentation, I saw this paragraph that looks to address the size concerns you're having:

The database moves objects freely between memory and storage. If an object has not been used in a while, it may be released and its contents loaded from storage the next time it is used.

다른 팁

An effective way to manage a structure like this is to use a memory-mapped file. In the file, instead of storing references for the node pointers, you store offsets into the file. You can still use pickle to serialise the node data to a stream suitable for storing on disk, but you will want to avoid storing references since the pickle module will want to embed your entire tree all at once (as you've seen).

Using the mmap module, you can map the file into address space and read it just like a huge sequence of bytes. The OS takes care of actually reading from the file and managing file buffers and all the details.

You might store the first node at the start of the file, and have offsets that point to the next node(s). Reading the next node is just a matter of reading from the correct offset in the file.

The advantage of memory-mapped files is that they aren't loaded into memory all at once, but only read from disk when needed. I've done this (on a 64-bit OS) with a 30 GB file on a machine with only 4 GB of RAM installed, and it worked fine.

What about storing it in sqlite?

SQLite:

has support for up to 2 terabytes of data,
supports SQL queries,
is great replacement for storing in-app data,
can support ~100k visits per day (if used for average web application),

So it may be good to take a closer look at this solution, as it has proven to be the efficient way to store data within applications.

Maybe you could combine cPickle and a bsddb database that will allow you to store your pickled nodes in a dictionnary-like object that will be stored on the filesystem; thus you could load the database later and get from the nodes you need with really good performances.

In a very simple way :

import bsddb
import cPickle

db = bsddb.btopen('/tmp/nodes.db', 'c')
def store_node(node, key, db):
    db[key] = cPickle.dumps(node)

def get_node(key, db):
    return cPickle.loads(db[key])

Try a compressed suffix tree instead.

The main idea is that instead of having lots of nodes of 1 character, you can compact them into 1 node of multiple characters thus saving extra nodes.

This link here (http://www.cs.sunysb.edu/~algorith/implement/suds/implement.shtml) says you can transform a 160MB suffix tree to 33MB compressed suffix tree. Quite a gain.

These compressed trees are used for genetic substring matching on huge strings. I used to run out of memory with a suffix tree, but after I compressed it, the out of memory error disappeared.

I wish I could find an unpaid article which explains the implementation better. (http://dl.acm.org/citation.cfm?id=1768593)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow