質問

Or more accurately stated, when two identical strings are concatenated to each other, why can't zlib deflate the entire second string? It seems that when a matching string starts immediately after the previous instance of the same string, zlib emits the first character as a string literal and then emits a backwards reference to the previous string minus the first character.

For example, if I use zlib to deflate the string latelate, the output is 5 string literals followed by a back reference...

l a t e l <len=3, dist=4>

or huffman encoded...

0000000 cb 49 2c 49 cd 01 62 00
0000010

where I've simplified the output by using a "raw" deflate stream (i.e. windowBits = -15) and the fixed huffman encoding (i.e. the compression strategy is Z_FIXED).

Why must zlib emit the second literal character 'l' before using a back reference to "ate"?

In other words, why can't it output...?

l a t e <len=4, dist=4>

I tried forcing the second version with my own deflate implementation, but zlib won't inflate the output. I get the error "invalid or incomplete deflate data".

役に立ちましたか?

解決

Let's separate DEFLATE, as a compression bitstream format described in https://www.ietf.org/rfc/rfc1951.txt from zlib, which is implementation of algorithms to encode and decode such bitstream.

Then, DEFLATE certainly can represent, and compress, concatenation of 2 strings. Why zlib doesn't do that? Well, because match searching for LZ77 compression is inherently heuristic task, so some choices won't be explored, even those which seem obvious to a human.

Using a trivial hash-based LZ77 encoder, double-string case is easily found:

L6c # l
L61 # a
L74 # t
L65 # e
C-4,4

And this sequence can be encoded with static zlib encoding without problem, result is:

CB 49 2C 49 05 61 00

This bitstream also can be decoded without problem by zlib. You can try that using Python:

import zlib
import binascii
zlib.decompress(binascii.unhexlify("CB492C49056100"), -15)

So, what version of zlib did you use? Maybe it was too old?

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top