How does gzip compression rate change when streaming data?

https://stackoverflow.com/questions/19700503

02-07-2022
|

Frage

I've been trying to understand the gzip algorithm, in the context of HTTP streaming connections (SSE, and the various comet technologies). I tested some alternative representations of data, with these filesizes:

40 csv.txt
63 json-short.txt
80 json-readable.txt

27 rawbin.txt

46 sse.csv.txt
69 sse.json-short.txt
86 sse.json-readable.txt

When compressed with gzip -9v, I get:

csv.txt:     25.0%
json-readable.txt:   16.2%
json-short.txt:  20.6%
rawbin.txt:  18.5%
sse.csv.txt:     25.0%
sse.json-readable.txt:   15.1%
sse.json-short.txt:  18.8%

Those are not very good compression rates, but also were the reverse of expectations: the more verbose JSON formats appear to be compressing worse.

My question is: does the compression get better as more and more data is streamed? Does it dynamically and implicitly learn which bits are scaffolding and which bits are variable data? If it is a learning algorithm, is there a point where it stops learning, or is it theoretically always adapting to the data stream? And if so, is extra weight given to more recent data?

I did a crude test, by cat-ing 25 of sse.json-readable.txt into one file. Gzip then gave me 95.7% compression ratio. But I describe this as crude for two reasons. First each line of data was identical, whereas in realistic data the numbers and timestamps will be slightly different, and only the scaffolding is the same. The second reason is gzip is being given a single file: does the gzip algorithm do a pre-scan of the data to learn it, or jump around in the file? If so, those results won't apply to Apache streaming data (as it will have already compressed and sent the first line of data, before it sees the second line).

As a secondary question, can I assume time is not a factor? E.g. assuming there is no socket reconnection involved, there might be a 1-second gap between each line of data, or a 60-second gap.

Useful reference on how gzip works: http://www.infinitepartitions.com/art001.html

(By the way, my current understanding is that the compression when streaming will be based solely on an analysis of the first block of data; so I'm wondering if I could get better compression by sending a few lines of dummy data, to give it a chance to learn a better compression?!?)

http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/mod_deflate.c The 15 is what gives the 32KB.

http://www.zlib.net/zlib_how.html http://www.zlib.net/zlib_tech.html

UPDATE: USEFUL LINKS

Here is the Apache module code: http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/mod_deflate.c

The windows size of 15 is what gives the 32KB window that Mark Adler mentions in his answer.

Here are some pages that can help understand the Apache code: http://www.zlib.net/zlib_how.html http://www.zlib.net/zlib_tech.html

Here are the above test files, in case you care:

csv.txt

2013-03-29 03:15:24,EUR/USD,1.303,1.304

json-short.txt

{"t":"2013-03-29 06:09:03","s":"EUR\/USD","b":1.303,"a":1.304}

json-readable.txt

{"timestamp":"2013-03-29 06:09:03","symbol":"EUR\/USD","bid":1.303,"ask":1.304}

sse.csv.txt

data:2013-03-29 03:15:24,EUR/USD,1.303,1.304

sse.json-short.txt

data:{"t":"2013-03-29 06:09:03","s":"EUR\/USD","b":1.303,"a":1.304}

sse.json-readable.txt

data:{"timestamp":"2013-03-29 06:09:03","symbol":"EUR\/USD","bid":1.303,"ask":1.304}

NOTE: the sse.* versions end in two LFs, the others end in one LF.

rawbin.txt was made with this PHP script:

$s=pack("la7dd",time(),"USD/JPY",1.303,1.304);
file_put_contents("rawbin.txt",$s);

Lösung

gzip uses a sliding window of the last 32K of data in which it searches for matching strings. It will accumulate 16K literals and matching strings, which may go back a few of those windows in order to generate a block with a single set of Huffman codes. That is as far back as gzip looks, and it never "jumps around", but rather just maintains a sliding history that it forgets once the old data drops off the back end.

There is a way with zlib (not with gzip) to provide a "priming" dictionary, which is simply up to 32K of data that can be used for matching strings when compressing the first 32K of the actual data. That is useful, sometimes very useful, for compressing small amounts of data, e.g much less than 32K. Without that zlib or gzip will do a poor job compressing short strings. They really need a few times 32K of data to get rolling.

For the extremely short files you are testing with, you are getting expansion, not compression.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow