How to avoid having to store extra information about padding (for byte size alignment) with Huffman coding
-
05-11-2019 - |
Pergunta
When storing the encoded Huffman bit stream in bytes, in general, either
- the final byte gets padded or
- a pseudo end-of-file symbol gets used
In the former case, the number of bits padded needs to be stored somewhere, requiring another 3 bits. In the latter, you will lose some efficiency due to the additional pseudo symbol.
So how can I avoid the extra cost of a pseudo symbol, and also avoiding the need to store/transmit extra bits to hold the number of pad bits?
N.B. On the one hand, I'm interested in compressing short strings, so saving 3 bits on average can shave off another 1% or so of the data.
But more importantly, getting rid of the need to send the padding info (which is only known after compressing the whole string) before the encoded data means I can more easily perform the algorithm in a streaming fashion. This is not really necessary for short strings, where everything may be kept in memory, but sometimes my strings actually do get much longer, and I don't want to have to keep that in memory (considering there are hundreds of simultaneous encodings going on).
Nenhuma solução correta