Question

I have written a code to take an input from a file of very big data, perform some simple processing on it and then store it in a shelve dictionary format. I have 41 million entries to process. However, after I write 35 million entries to the shelve dict, performance suddenly drops and eventually completely halts. Any idea what I can do to avoid this?

My data is on twitter and it maps user screen names to their ID's. Like so:

Jack 12
Mary 13
Bob 15

I need to access each of these by name very quickly. Like: when I give my_dict[Jack] it returns 12.

Was it helpful?

Solution

Consider using something more low-level. Shelve performance can be quite low, unfortunately. This doesn't explain the slow-down you are seeing though.

For many disk-based indexes it helps if you can initialize them with an expected size, so they do not need to reorganize themselves on the fly. I've seen this with a huge performance impact for on-disk hash-tables in various libraries.

As for your actual goal, have a look at:

http://docs.python.org/library/persistence.html

in particular the gdbm, dbhash, bsddb, dumbdbm and sqlite3 modules.

sqlite3 is probably not the fastest, but the easiest to use one. After all, it has a command line SQL client. bsddb probably is faster, in particular if you tune nelem and similar parameters for you data size. And it does have a lot of language bindings, too; likely even more than sqlite.

Try to create your database with an initial size of 41 million, so it can optimize for this size!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top