Using bzip2 low-level routines to compress chunks of data

https://stackoverflow.com/questions/13065023

14-07-2021
|

質問

The Overview

I am using the low-level calls in the libbzip2 library: BZ2_bzCompressInit(), BZ2_bzCompress() and BZ2_bzCompressEnd() to compress chunks of data to standard output.

I am migrating working code from higher-level calls, because I have a stream of bytes coming in and I want to compress those bytes in sets of discrete chunks (a discrete chunk is a set of bytes that contains a group of tokens of interest — my input is logically divided into groups of these chunks).

A complete group of chunks might contain, say, 500 chunks, which I want to compress to one bzip2 stream and write to standard output.

Within a set, using the pseudocode I outline below, if my example buffer is able to hold 101 chunks at a time, I would open a new stream, compress 500 chunks in runs of 101, 101, 101, 101, and one final run of 96 chunks that closes the stream.

The Problem

The issue is that my bz_stream structure instance, which keeps tracks of the number of compressed bytes in a single pass of the BZ2_bzCompress() routine, seems to claim to be writing more compressed bytes than the total bytes in the final, compressed file.

For example, the compressed output could be a file with a true size of 1234 bytes, while the number of reported compressed bytes (which I track while debugging) is somewhat higher than 1234 bytes (say 2345 bytes).

My rough pseudocode is in two parts.

The first part is a rough sketch of what I do to compress a subset of chunks (and I know that I have another subset coming after this one):

bz_stream bzStream;
unsigned char bzBuffer[BZIP2_BUFFER_MAX_LENGTH] = {0};
unsigned long bzBytesWritten = 0UL;
unsigned long long cumulativeBytesWritten = 0ULL;
unsigned char myBuffer[UNCOMPRESSED_MAX_LENGTH] = {0};
size_t myBufferLength = 0;

/* initialize bzStream */
bzStream.next_in = NULL;
bzStream.avail_in = 0U;
bzStream.avail_out = 0U;
bzStream.bzalloc = NULL;
bzStream.bzfree = NULL;
bzStream.opaque = NULL;
int bzError = BZ2_bzCompressInit(&bzStream, 9, 0, 0); 

/* bzError checking... */

do
{
    /* read some bytes into myBuffer... */

    /* compress bytes in myBuffer */
    bzStream.next_in = myBuffer;
    bzStream.avail_in = myBufferLength;
    bzStream.next_out = bzBuffer;
    bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
    do 
    {
        bzStream.next_out = bzBuffer;
        bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
        bzError = BZ2_bzCompress(&bzStream, BZ_RUN);

        /* error checking... */

        bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
        cumulativeBytesWritten += bzBytesWritten;

        /* write compressed data in bzBuffer to standard output */
        fwrite(bzBuffer, 1, bzBytesWritten, stdout);
        fflush(stdout);
    } 
    while (bzError == BZ_OK);
} 
while (/* while there is a non-final myBuffer full of discrete chunks left to compress... */);

Now we wrap up the output:

/* read in the final batch of bytes into myBuffer (with a total byte size of `myBufferLength`... */

/* compress remaining myBufferLength bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do 
{
    bzStream.next_out = bzBuffer;
    bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
    bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);

    /* bzError error checking... */

    /* increment cumulativeBytesWritten by `bz_stream` struct `total_out_*` members */
    bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
    cumulativeBytesWritten += bzBytesWritten;

    /* write compressed data in bzBuffer to standard output */
    fwrite(bzBuffer, 1, bzBytesWritten, stdout);
    fflush(stdout);
} 
while (bzError != BZ_STREAM_END);

/* close stream */
bzError = BZ2_bzCompressEnd(&bzStream);

/* bzError checking... */

The Questions

Am I calculating cumulativeBytesWritten (or, specifically, bzBytesWritten) incorrectly, and how would I fix that?

I have been tracking these values in a debug build, and I do not seem to be "double counting" the bzBytesWritten value. This value is counted and used once to increment cumulativeBytesWritten after each successful BZ2_bzCompress() pass.

Alternatively, am I not understanding the correct use of the bz_stream state flags?

For example, does the following compress and keep the bzip2 stream open, so long as I keep sending some bytes?

bzError = BZ2_bzCompress(&bzStream, BZ_RUN);

Likewise, can the following statement compress data, so long as there are at least some bytes are available to access from the bzStream.next_in pointer (BZ_RUN), and then the stream is wrapped up when there are no more bytes available (BZ_FINISH)?

bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);

Or, am I not using these low-level calls correctly at all? Should I go back to using the higher-level calls to continuously append a grouping of compressed chunks of data to one main file?

There's probably a simple solution to this, but I've been banging my head on the table for a couple days in the course of debugging what could be wrong, and I'm not making much progress. Thank you for any advice.

解決

In answer to my own question, it appears I am miscalculating the number of bytes written. I should not use the total_out_* members. The following correction works properly:

bzBytesWritten = sizeof(bzBuffer) - bzStream.avail_out;

The rest of the calculations follow.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow