Question

I have a huge text file of about 500MB in size. I tried to archive it with Gzip both from a python program and the command line. But, in both cases the archived file's size is about 240MB, whereas while archiving with WinRAR in Windows, the archived file size is around 450KB. Is there something I am missing here? Why is there so much difference and what can I do to achieve the similar level of compression?

I have tagged this with Python also, as any python code regarding this will be very helpful.

Here is first 3 lines of the file:

$ head 100.txt -n 3
31731610:22783120;
22783120:
45476057:39683372;5879272;54702019;58780534;30705698;60087296;98422023;55173626;5607459;843581;11846946;97676518;46819398;60044103;48496022;35228829;6594795;43867901;66416757;81235384;42557439;40435884;60586505;65993069;76377254;82877796;94397118;39141041;2725176;56097923;4290013;26546278;18501064;27470542;60289066;43986553;67745714;16358528;63833235;92738288;77291467;54053846;93392935;10376621;15432256;96550938;25648200;10411060;3053129;54530514;97316324;
Was it helpful?

Solution

It is possible that the file is highly redundant with a repeating pattern that is larger than 32K. gzip's deflate only looks 32K back for matches, whereas the others can capitalize on history much further back.

Update:

I just made a file that is a 64K block of random data, repeated 4096 times (256 MB). gzip (with 32K window) was blind to the redundancy and so unable to compress it. gzip expanded it to 256.04 MB. xz (LZMA with 8 MB window) compressed it to 102 KB.

OTHER TIPS

WinRAR and Gzip are two very different compression programs. They each use different algorithms to compress data. Here are the descriptions of each type from Wikipedia:

Version 3 of RAR is based on Lempel-Ziv (LZSS) and prediction by partial matching (PPM) compression, specifically the PPMd implementation of PPMII by Dmitry Shkarin.

http://en.wikipedia.org/wiki/RAR#Compression_algorithm

And Gzip:

It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv (LZ77) and Huffman coding.

en.wikipedia.org/wiki/Gzip

My guess would be some sort of difference between how Prediction by partial matching and Huffman coding work. That file has very interesting properties though... What is the file?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top