Writing a random access file transparently to a zip file

https://stackoverflow.com/questions/12300233

30-06-2021
|

Question

I have a java application that writes a RandomAccessFile to the file system. It has to be a RAF because some things are not known until the end, where I then seek back and write some information at the start of the file.

I would like to somehow put the file into a zip archive. I guess I could just do this at the end, but this would involve copying all the data that has been written so far. Since these files can potentially grow very large, I would prefer a way that somehow did not involve copying the data.

Is there some way to get something like a "ZipRandomAccessFile", a la the ZipOutputStream which is available in the jdk?

It doesn't have to be jdk only, I don't mind taking in third party libraries to get the job done.

Any ideas or suggestions..?

Solution

Maybe you need to change the file format so it can be written sequentially.

In fact, since it is a Zip and Zip can contain multiple entries, you could write the sequential data to one ZipEntry and the data known 'only at completion' to a separate ZipEntry - which gives the best of both worlds.

It is easy to write, not having to go back to the beginning of the large sequential chunk. It is easy to read - if the consumer needs to know the 'header' data before reading the larger resource, they can read the data in that zip entry before proceeding.

OTHER TIPS

The way the DEFLATE format is specified, it only makes sense if you read it from the start. So each time you'd seek back and forth, the underlying zip implementation would have to start reading the file from the start. And if you modify something, the whole file would have to be decompressed first (not just up to the modification point), the change applied to the decompressed data, then compress the whole thing again.

To sum it up, ZIP/DEFLATE isn't the format for this. However, breaking your data up into smaller, fixed size files that are compressed individually might be feasible.

The point of compression is to recognize redundancy in data (like some characters occurring more often or repeated patterns) and make the data smaller by encoding it without that redundancy. This makes it infeasible to create a compression algorithm that would allow random access writing. In particular:

You never know in advance how well a piece of data can be compressed. So if you change some block of data, its compressed version will be most likely either longer or shorter.
As a compression algorithm process the data stream, it uses the knowledge accumulated so far (like discovered repeated patterns) to compress the data at its current position. So if you change something, the algorithm needs to re-compress everything from this change to the end.

So the only reasonable solution is to manipulate the data and compress at once it at the end.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow