Get size of uncompressed data in zlib?

https://stackoverflow.com/questions/929757

06-09-2019
|

Question

I'm creating something that includes a file upload service of sorts, and I need to store data compressed with zlib's compress() function. I send it across the internet already compressed, but I need to know the uncompressed file size on the remote server. Is there any way I can figure out this information without uncompress()ing the data on the server first, just for efficiency? That's how I'm doing it now, but if there's a shortcut I'd love to take it.

By the way, why is it called uncompress? That sounds pretty terrible to me, I always thought it would be decompress...

Solution

The zlib format doesn't have a field for the original input size, so I doubt you will be able to do that without simulating a decompression of the data. The gzip format has a "input size" (ISIZE) field, that you could use, but maybe you want to avoid changing the compression format or having the clients sending the file size.

But even if you use a different format, if you don't trust the clients you would still need to run a more expensive check to make sure the uncompressed data is the size the client says it is. In this case, what you can do is to make the uncompress-to-/dev/null process less expensive, making sure zlib doesn't write the output data anywhere, as you just want to know the uncompressed size.

OTHER TIPS

I doubt it. I don't believe this is something the underlying zlib libraries provide from memory (although it's been a good 7 or 8 years since I used it, the up-to-date docs don't seem to indicate this feature has been added).

One possibility would be to transfer another file which contained the uncompressed size (e.g., transfer both file.zip and file.zip.size) but that seems fraught with danger, especially if you get the size wrong.

Another alternative is, if the server uncompressing is time-expensive but doesn't have to be done immediately, to do it in a lower-priority background task (like with nice under Linux). But again, there may be drawbacks if the size checker starts running behind (too many uploads coming in).

And I tend to think of decompression in terms of "explosive decompression", not a good term to use :-)

If you're uploading using the raw 'compress' format, then you won't have information on the size of the data that's being uploaded. Pax is correct in this regard.
You can store it as a 4 byte header at the start of the compression buffer - assuming that the file size doesn't exceed 4GB.
some C code as an example:

 uint8_t *compressBuffer = calloc(bufsize + sizeof (uLongf), 0);
 uLongf compressedSize = bufsize;
 *((uLongf *)compressBuffer) = filesize;
 compress(compressBuffer + sizeof (uLongf), &compressedSize, sourceBuffer, bufsize);

Then you send the complete compressBuffer of the size compressedSize + sizeof (uLongf). When you receive it on the server side you can use the following code to get the data back:

 // data is in compressBuffer, assume you already know compressed size.
 uLongf originalSize = *((uLongf *)compressBuffer);
 uint8_t *realCompressBuffer = compressBuffer + sizeof (uLongf);

If you don't trust the client to send the correct size then you will need to perform some sort of uncompressed data check on the server size. The suggestion of using uncompress to /dev/null is a reasonable one.
If you're uploading a .zip file, it contains a directory which tells you the size of the file when it's uncompressed. This information is built into the file format, again, though this is subject to malicious clients.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow