Is there a correlation of zip compression ratio and density of information provided by a text?

https://cs.stackexchange.com/questions/14150

16-10-2019
|

Question

I'll phrase my question using an intuitive and rather extreme example:

Is the expected compression ratio (using zip compression) of a children's book higher than that of a novel written for adults?

I read somewhere that specifically the compression ratio for zip compression can be considered an indicator for the information (as interpreted by a human being) contained in a text. Can't find this article anymore though.

I am not sure how to attack this question. Of course no compression algorithm can grasp the meaning of verbal content. So what would a zip compression ratio reflect when applied to a text? Is it just symbol patterns - like word repetitions will lead to higher ratio - so basically it would just reflect the vocabulary?

Update:

Another way to put my question would be whether there is a correlation which goes beyond repetition of words / restricted vocabulary.

Tangentially related:

Relation of Word Order and Compression Ratio and Degree of Structure

Solution

Shannon's noisless coding theorem is the formal statement that the size of an optimally compressed data stream is equivalent to the amount of information in that data stream.

But you are also right that the definition of "amount of information" depends on your probability model for the data stream. If you have a correct table of letter probabilities before you start the compression then Huffman coding gets within a constant factor of optimal, Arithmetic coding gets even closer to optimal. But if there are correlations between neighboring symbols (which there are in real human produced text) you can do better by choosing codes for pairs of letters. And you can do even better if you look at triples, and so on. Additionally, you typically don't have a very good probabilistic model of your data stream when you start, so you need to adaptively construct the probability table and then assign variable length symbols depending on the probabilities as you learn more.

The kinds of compression used in zip/gzip, compress, and 7-zip are all variants of Lempel-Ziv coding. Lempel-Ziv coding is adaptive, builds probability tables over varying length chunks of letters, and is optimal in the sense that given an infinite random stream that is ergodic (the probabilities are stable over time) it will, in the limit, adaptively find a table that takes into account correlations over arbitrary distances, and then use that table to produce a stream that approaches optimally coded.

Of course, neither children's books nor novels for adults are infinitely long ergodic random processes (although some of them may seem like they are) so the conditions of the theorems don't hold.

Here's an interesting interview with Jacob Ziv where he talks a little about the universality of Lempel-Ziv coding: http://www.ieeeghn.org/wiki/index.php/Oral-History:Jacob_Ziv#Lempel-Ziv_algorithm.2C_Viterbi_algorithm.2C_and_Ziv-Zakai_bound.

OTHER TIPS

I'd say this sounds highly likely. Suppose you take a fairly large sample of children's literature and a similarly-sized (in characters) sample of adult literature. It seems entirely reasonable to suspect that there is a greater variety of words in the adult literature, and that these words may be longer or may rely on more unusual diphthongs than words used in children's literature. This may further imply that children's literature has more whitespace and possibly more punctuation than adult literature. Taken together, this seems to point towards children's literature being much more homogenous at the scale of 5-10 characters when compared to adult literature, so compression techniques that can take advantage of homogeneity at the scale of up to at least a dozen or so characters should be able to compress children's text more efficiently than they could adult literature.

Of course, this makes some assumptions about what is considered children's literature and what is considered adult literature. What do you consider "Gulliver's travels", for instance? My discussion above assumes that we're considering books that are clearly for young children and books that are clearly for adults; compare "Goodnight, Moon" to "1984", for instance.

I do not know about the direct connection between compression ratio and level of writing. It seems logic though.

There is a nice suggestion to use zipping data to find the relative distance between texts (or DNA, languages): Cilibrasi, R. - Vitányi, P.M.B. Clustering by compression 2005 - IEEE Transactions on Information Theory, 51, p.1523–1545.

"we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD , computed from the lengths of compressed data files (singly and in pairwise concatenation)."

Each letter is correlated with a binary representation (based upon ASCII value), in your example the amount of data would result in a larger compression file of the adult book v. children's book, but the density of the data for compression is equivalent in binary which is demonstrated by comparing, say, 0110 and 0010. The compression is not affected by the interpretation of data, but by the fact that the strings themselves incorporate more characters and therefore longer ASCII values, which in turn determines the number of octets and therefore the size of the compressed folder.

A frequency distribution I developed in algorithmic form several years ago is expressed as a Bayesian inference engine of the probability of a feature detected letter given the range of letters and their correlated values attributed to them.

I gave a lecture on this at Augustana's annual symposium in 2010:

http://youtu.be/BPdQjBSW0_w

The video explains how a natural language parser can operate with feature detection within the context of the pairing of associated letters recognized by assigning numerical values to the feature of the letter.

The analysis of my data shows that yes, the relationship between associated letters in a word, phrase and sentence can be formalized and implemented into a system based on the decomposition of the elements of a specific letter.

The correlation is based upon the lexicon programmed in the system coupled with decomposition of the elements of a letter such that:

A diagonal line has a value of 0.5 for each diagonal, a horizontal line has a value of 1 and a vertical line has a value of (1.5), a semi-circle has a value of 2 and a full circle has a value of 3. These values are then added and compared to the results of associated letters in the word, entered into a Bayesian inference engine and finally calculated to reach a given threshold to allow for the accept state.

Well, I don't think anyone has studied this particular problem and the answer is unlikely to be possible on purely theoretical grounds... However, some things can be said:

Odds are very good the the longer text will compress better [in terms of compression ratio, not in terms of absolute length, of course] no matter if it is adult or child literature. (A point already mentioned by Peter Shor in a comment above.)
Even the best LZ-based text compression algorithms (which use actual words unlike the run-of-the-mill zip) get beaten by stock bzip2 on text compression ratio; bzip2 does something seemingly "very dumb" [actually very smart] in that it [reversibly] scrambles the text into gibberish stings but with many repeated letters; this is the Burrows-Wheeler transform. I do not expect that the frequency of words will matter much for bzip2-style compression, but instead the letter frequency probably would; it is much harder to guesstimate if there's a difference with respect to the latter between child and adult literature, but I suspect they don't differ substantially on letter frequencies.

Both points can be seen [on general texts, not child vs. adult] in tables 2-3 (p. 43) of the paper "Text Compression: Syllables" by Jan Lansky and Michal Zemlicka (sorry for leaving out the diacritics in their names). Mind you, this is for English and Czech text [beware that table 3 has an incorrect caption, "bytes" should be "bits", same as in table 2], and you can see that bzip2 does less well on the latter almost surely because there are more symbols in base alphabet for Czech compared to English. So for (say) Chinese (using a native encoding, not via Romanization), I would not bet on bzip2 anymore...

Strangely enough, Salomon's Handbook of Data Compression [5th ed. 2010], where I found ref to the aforementioned paper and a fairly ample [re]presentation on pp. 1122-1127, doesn't mention the sequel [which was promised in the conclusion section of Lansky and Zemlicka]: Syllable-Based Burrows-Wheeler Transform (2007). Alas the results are underwhelming: you need fairly large documents (>200-500kB) for syllable-based BWT to overtake the letter-based BWT, and even the then the improvements are fairly small. The word-based BWT, also tested in there, never wins, although the max document size was 5MB. I think this shows that word repetition is not the major factor in the text compression ratio (if done right) compared to letter and syllable repetition (frequencies), at least for the usual book-sized documents. So I think answers the question negatively (as in probably no difference) for same-size documents and only some difference in word repetition (like child vs. adult literature would entail) when using stock bzip2 or the experimental syllable-based BTW. Perhaps if you compress something much more vast like the whole Wikipedia (or to particularize to this case, two vast collections of literature, one for adults for one children), then word repetition might be the key factor in getting good compression (and you would want a word-based BTW).

A couple more twists to this:

The research cited above assumes a single choice of syllable/word/letter as "predictor". Context mixing (CM) algorithms (pioneered by PAQ) actually use several (adaptively weighted) predictors; for example PAQ use both a previous-word and bit-level PPM predictor (among others). The previous word may seem rather unlikely to work much, but on a 1GB sample of Wikipedia, it actually works quite well because there are some bot-generated articles with templated text, like those based on US census data. (See the 3rd image, "repeated string analysis" and commentary in http://mattmahoney.net/dc/dce.html#Section_22. Finding all of them requires a large amounts or RAM though; the best performing derivative of PAQ [called cmix] on that dataset uses ~30GB of RAM.) I would expect a CM algorithm [which uses previous words as one of the predictors] to do better on texts with smaller vocabularies (like children stories) all other things being equal.
If you're willing to apply text-only transforms (BWT is not so), then you can do better than BWT; see the (somewhat improperly named) "word replacing transformation" (WRT) paper which actually consists of a whole bunch of little tweaks that make words matching each other more likely, like case conversion, replacing [in a reversible manner] end-of-lines with spaces, inserting spaces before punctuation etc. It doesn't seem like any of that would favor smaller vocabularies, but there are a lot of techniques listed in there and I haven't read that paper thoroughly.

In a way combining the two previous points, one can actually use a word-based (word as in natural-language word) compression as a pre-processing step for a general-purpose (including LZ-based) compressor. I'm a bit surprised this hasn't been tested much before the 2009 paper by Farina et al. This idea does pretty well when the base compressor is gzip, but the compression ratio improvements for bzip2, 7zip, or PPMd as backends are rather modest. Perhaps the most notable thing about this precompression is that it increases overall compression speed. Percentage-wise, this was most notable for 7zip as it saw a 3x-4x compression speed improvement! 7zip is LZMA-based, so roughly speaking LZ on steroids: a very large dictionary and with range encoding instead of Huffman used in gzip/DEFLATE. (Tables 6-8 have the gist of the results.)

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange