Entropy-Decoder. Extract unknown number of coded coefficients from data

Question

The idea of sequentially extracting single bits is not too bad. If done right, it may be almost as fast as any other solution.

Bit sequences at arbitrary positions in a stream of granularity g, with g=16 for a stream of (16-bit) words for instance, can be handled block-wise on blocks of size g.

To extract the bits at positions s through e (with (e - s) <= g) from a stream as a 'right-aligned' number an example implementation may be:

shift = s % g

lowerBits = data[ floor( s / g ) ] >> shift
upperBits = data[ floor( e / g ) ] << (g - shift)

bitSequence = (lowerBits | upperBits) & ( (1 << (e-s)) -1 )[*]

[*] this last term only masks out any unneeded upper bits we may have got and makes them 0 in the final result.

(careful with the endianess of your data :))

Whether this will really speed things up or not cannot be determined in general. (Depends on the data being processed, the underlying computing hardware, the compiler used &c. Note that some divisions and one modulo operation are required which might slow down the algorithm significantly.)

Extracting bits one-by-one can be done quite efficiently in the same manner. For example:

blockIndex = floor( bitPosition / g )
bitIndex = bitPosition % g
nextBit = (data[ blockIndex ] >> bitIndex) & 1

This can of course be optimized to avoid the re-calculation of blockIndex and bitIndex if and when the bitPosition is always only incremented by 1.

Another common approach is to use a variable 'mask' to extract the single bits:

mask = 1
index = 0
while ( not all bits read ) { 
  block = data[index]
  if ( mask & block != 0 ) {
    // a 1 was encountered
  } else {
    // a 0 was encountered
  }
  mask = mask << 1
  if ( mask == 0 ) {
    mask = 1
    index = index + 1
  }
}

Note how the mask is used to both mask the current bit and keep track of when to advance to the next block of data. For this to work, mask must of course be of the same width g as the data blocks.

To sum it all up:

I don't think, in a general case, the solution can be more efficient than one loop iteration per bit read and any optimizations will only slightly change the performance in one direction or the other.