Question

This question on archiving PDF's got me wondering -- if I wanted to compress (for archival purposes) lots of files which are essentially small changes made on top of a master template (a letterhead), it seems like huge compression gains can be had with inter-file compression.

Do any of the standard compression/archiving formats support this? AFAIK, all the popular formats focus on compressing each single file.

Was it helpful?

Solution

Several formats do inter-file compression.

The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.

More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.

OTHER TIPS

Take a look at google's open-vcdiff.

http://code.google.com/p/open-vcdiff/

It is designed for calculating small compressed deltas and implements RFC 3284.

http://www.ietf.org/rfc/rfc3284.txt

Microsoft has an API for doing something similar, sans any semblance of a standard.

In general the algorithms you are looking for are ones based on Bentley/McIlroy:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470

In particular these algorithms will be a win if the size of the template is larger than the window size (~32k) used by gzip or the block size (100-900k) used by bzip2.

They are used by Google internally inside of their BIGTABLE implementation to store compressed web pages for much the same reason you are seeking them.

Since LZW compression (which pretty much they all use) involves building a table of repeated characters as you go along, such as schema as you desire would limit you to having to decompress the entire archive at once.

If this is acceptable in your situation, it may be simpler to implement a method which just joins your files into one big file before compression.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top