I've been trying to understand the gzip algorithm, in the context of HTTP streaming connections (SSE, and the various comet technologies). I tested some alternative representations of data, with these filesizes:
40 csv.txt
63 json-short.txt
80 json-readable.txt
27 rawbin.txt
46 sse.csv.txt
69 sse.json-short.txt
86 sse.json-readable.txt
When compressed with gzip -9v
, I get:
csv.txt: 25.0%
json-readable.txt: 16.2%
json-short.txt: 20.6%
rawbin.txt: 18.5%
sse.csv.txt: 25.0%
sse.json-readable.txt: 15.1%
sse.json-short.txt: 18.8%
Those are not very good compression rates, but also were the reverse of expectations: the more verbose JSON formats appear to be compressing worse.
My question is: does the compression get better as more and more data is streamed? Does it dynamically and implicitly learn which bits are scaffolding and which bits are variable data? If it is a learning algorithm, is there a point where it stops learning, or is it theoretically always adapting to the data stream? And if so, is extra weight given to more recent data?
I did a crude test, by cat
-ing 25 of sse.json-readable.txt into one file. Gzip then gave me 95.7% compression ratio. But I describe this as crude for two reasons. First each line of data was identical, whereas in realistic data the numbers and timestamps will be slightly different, and only the scaffolding is the same. The second reason is gzip is being given a single file: does the gzip algorithm do a pre-scan of the data to learn it, or jump around in the file? If so, those results won't apply to Apache streaming data (as it will have already compressed and sent the first line of data, before it sees the second line).
As a secondary question, can I assume time is not a factor? E.g. assuming there is no socket reconnection involved, there might be a 1-second gap between each line of data, or a 60-second gap.
Useful reference on how gzip works: http://www.infinitepartitions.com/art001.html
(By the way, my current understanding is that the compression when streaming will be based solely on an analysis of the first block of data; so I'm wondering if I could get better compression by sending a few lines of dummy data, to give it a chance to learn a better compression?!?)
http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/mod_deflate.c
The 15 is what gives the 32KB.
http://www.zlib.net/zlib_how.html
http://www.zlib.net/zlib_tech.html
UPDATE: USEFUL LINKS
Here is the Apache module code:
http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/mod_deflate.c
The windows size of 15 is what gives the 32KB window that Mark Adler mentions in his answer.
Here are some pages that can help understand the Apache code:
http://www.zlib.net/zlib_how.html
http://www.zlib.net/zlib_tech.html
Here are the above test files, in case you care:
csv.txt
2013-03-29 03:15:24,EUR/USD,1.303,1.304
json-short.txt
{"t":"2013-03-29 06:09:03","s":"EUR\/USD","b":1.303,"a":1.304}
json-readable.txt
{"timestamp":"2013-03-29 06:09:03","symbol":"EUR\/USD","bid":1.303,"ask":1.304}
sse.csv.txt
data:2013-03-29 03:15:24,EUR/USD,1.303,1.304
sse.json-short.txt
data:{"t":"2013-03-29 06:09:03","s":"EUR\/USD","b":1.303,"a":1.304}
sse.json-readable.txt
data:{"timestamp":"2013-03-29 06:09:03","symbol":"EUR\/USD","bid":1.303,"ask":1.304}
NOTE: the sse.* versions end in two LFs, the others end in one LF.
rawbin.txt was made with this PHP script:
$s=pack("la7dd",time(),"USD/JPY",1.303,1.304);
file_put_contents("rawbin.txt",$s);