Question

I have a dict from some hash key to a bitstring. The bitstring can be variable length, but are generally < 160 bits and usually <80. I have about 80M key value pairs.

How can I store this data structure in as little memory as possible? I don't want to pad the bitstrings, or I will lose quite a bit of space (no pun intended).

I assume that I will have to store a byte at the beginning giving the length of the bit string. That's okay.

What is the most memory-efficient way to store this dict in memory?

I would prefer to use Python, but am open to other choices.

Was it helpful?

Solution

If you're referring to padding the bitstrings to an integral number of bytes, then you could store the concatenation of all the initial bitstrings in a single bitstring, and keep a dictionary whose values are tuples of the form (bit position, length).

Problem is, if I did the math right, then the length of this larger bitstring can be over 12 billion bits, so the bit position would be need a couple of bits over an int. Then if you need to pad the bit position itself we're back to square one.

If, however, the number of distinct lengths is < 64, you could fit the length field in 6 bits, and you end up with a dictionary mapping hash keys to 5-byte tuples indexing into the single bitstring. Would this work for you?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top