Question

I'm experimenting with some ideas in which algorithms have to work on bits as their smallest unit of information. This is a modular application where the user may rearrange parts of the "pipeline" like a unix shell pipeline. These algorithms do various things like framing, compression, decompression, error checking and correcting; introducing, detecting and removing noise, etc.

Since they work on the bit level, the algorithms may, for example, take 5 bits of input and produce 19 bits of output. The input and output are rarely multiple of bytes.

Working with bit streams in memory and between threads is fine with the help of std::vector<bool>, but I have to retrieve and store this stream of bits from/to somewhere, and preferably it should be possible to do actual command-line pipelines like:

prog1 < bitsource.dat | prog2 -opts | prog3 -opts > bitsink.dat

Or even:

prog1 | prog2 | ssh user@host /bin/sh -c "prog3 | prog4 > /dev/dsp"

The problem is how to serialize these bits efficiently, since the standard streams (stdin and stdout) are byte-oriented. I have to handle situations where the number of bits in the input and output are not multiple of a byte.

Currently, I have a working proof-of-concept that does it by expanding each bit to a byte that is either 0x30 or 0x31 ("0" or "1"). Clearly, this increases the size of data by a factor of eight, consuming 8× more space and bandwidth than necessary. I would like to have these bits packed in a more efficient manner.

One alternative that I'm considering is a protocol that buffers the bits in the output and produces blocks consisting of a Length header followed by ceiling(Length/8) bytes of data, flushing the output whenever appropriate.

But instead of creating a made-up-protocol, I would like to know if someone already had these requirements, what are your experiences, and if there is already some standard protocol for this (serialization of an arbitrary number of bits) that I could use. Perhaps someone already had this problem and is already using some form of encoding that could also be used in this application, to avoid the proliferation of incompatible formats.

Was it helpful?

Solution

protocol that buffers the bits in the output and produces blocks consisting of a Length header followed by ceiling(Length/8) bytes of data, flushing the output whenever appropriate.

This is typical. There aren't really any alternatives that would be suitably simple.

Serialization of bits -- as bits -- is rare. Bitmap indexes are about the only example that comes to mind.

The Pascal programming language encoded all strings with a length followed by the bytes of the string. You're doing a similar thing except it's bits, not bytes.

This is similar to "run-length encoding" where runs of identical values are replaced by a header and the bytes. The PackBits algorithm, for example, is a simple RLE that provides header plus data. It works at the byte level (not the bit level) but it's essentially the same design pattern.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top