Question

I am trying to use Berkeley DB to store a frequency table (i.e. hashtable with string keys and integer values). The table will be written, updated, and read from Python; so I am currently experimenting with bsddb3. This looks like it will do most of what I want, except it looks like it only supports string values?

If I understand correctly, Berkeley DB supports any kind of binary key and value. Is there a way to efficiently pass raw long integers in/out of Berkeley DB using bsddb3? I know I can convert the values to/from strings, and this is probably what I will end up doing, but is there a more efficient way? I.e. by storing 'raw' integers?


Background: I am currently working with a large (potentially tens, if not hundreds, of millions of keys) frequency table. This is currently implemented using a Python dictionary, but I abort the script when it starts to swap into virtual memory. Yes I looked at Redis, but this stores the entire database in memory. So I'm about to try Berkeley DB. I should be able to improve the creation efficiency by using short-term in-memory caching. I.e. create an in-memory Python dictionary, and then periodically add this to the master Berkeley DB frequency table.

Was it helpful?

Solution

Do you need to read the data back from a language other than python? If not, you can just use pickle on the python long integers, and unpickle them when you read them back in. You might be able to (probably be able to) use the shelve module, which would do this automatically for you. But even if not, you can manually pickle and unpickle the values.

>>> import cPickle as pickle
>>> pickle.dumps(19999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999, pickle.HIGHEST_PROTOCOL)
'\x80\x02\x8a(\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x7fT\x97\x05p\x0b\x18J#\x9aA\xa5.{8=O,f\xfa\x81|\xa1\xef\xaa\xfd\xa2e\x02.'
>>> pickle.loads('\x80\x02\x8a(\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x7fT\x97\x05p\x0b\x18J#\x9aA\xa5.{8=O,f\xfa\x81|\xa1\xef\xaa\xfd\xa2e\x02.')
19999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999L

OTHER TIPS

Python struct to convert an integer to bytes in Python 3 or string in Python 2. Depending on your data you might use different packing format for unsigned long long or uint64_t :

struct.unpack('>Q', my_integer)

This will return the byte representation of my_integer on bigendian which match the lexicographical order required by bsddb key values. You can come with smarter packing function (have a look at wiredtiger.intpacking) to save a space.

You don't need a Python cache, use DBEnv.set_cache_max and set_cache.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top