Question

I want to store a large dictionary using shelve. More specifically, what I have is rather a dictionary of dictionaries, like a matrix:

dict[key1][key2] = value

The number of keys is around 5.000, which makes a matrix of 5.000x5.000 elements, impossible to store in my 4Gb of memory. For that reason I thought shelve may be a good solution.

Nevertheless, I haven't found any documentation about how to build a dictionary of dictionaries efficiently.

So far, what I have done is to use shelve to create a dictionary that contains regular dictionaries:

def create_entry(key1,key2,value,shelf): # shelf is the opened shelve file
   if key1 not in shelf: # first time we see this key, initialise it
    shelf[key1]={}
   # add the contents
   shelf[key1][key2]=value

That works, but it seems to be that there should be a better way to do that taking full advantage of shelve. Any ideas?

Was it helpful?

Solution

The only real problem with your solution is that a shelf stores values by pickling them, so each of your second-level dicts has to be pickled and unpickled every time it's paged off to disk, which could have a significant performance cost. (Also, of course, each second-level dict has to fit in memory, but that's probably not an issue.)

If that doesn't matter, what you're doing is fine. But if it does, there are two other options.

You could use some more complex database than shelve and build your own wrapper around it.

Or, far more simple: Just use a single-level dict with pairs as keys:

def create_entry(key1, key2, value, shelf):
    shelf[(key1, key2)] = value

The downside with this is that if you need to, e.g., iterate all of the keys or values in the key1 subdictionary, you'll have to iterate the entire shelf. It's not exactly complicated ((key2 for key1, key2 in shelf if key1 == key)), but it will be a lot slower.


The code above doesn't actually work directly for shelve, because the keys have to be strings. But there are pretty easy ways to wrap that up; just as shelve pickles and unpickles values, you can write a wrapper than stringifies and parses keys—e.g.:

def _mapkey(self, key):
    return ','.join(map(str, key))
def _unmapkey(self, key):
    return tuple(map(int, key.split(',')))
def __getitem__(self, key):
    return self.shelf[self._mapkey(key)]
def __iter__(self):
    return map(self._unmapkey, self.shelf)
# etc.

For efficiency, it might be better to fork or subclass shelve so you can just generate a bytes for the underlying dbm key, instead of generating a str so that shelve can encode it. Then you could do this:

def _mapkey(self, key):
    return struct.pack('>II', key)
def _unmapkey(self, key):
    return struct.unpack('<II', key)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top