Lossless compression theory, is compression ratio based on size of pattern and times repeated?

https://stackoverflow.com/questions/12790084

06-07-2021
|

Pregunta

I was wondering which of the following scenarios will achieve the highest ratio with lossless algorithms applied to binary data with repeated data.

Am I correct to assume the compression ratio depends on patterns?

Size
Times repeated

For example the binary data:

10 10 10 10 10 10 10 10 pattern (10) size 2, pattern (10) repeated 8

1001 1001 1001 1001 pattern (1001) size 4, pattern (1001) repeated 4

0000000 11111111 pattern (0) size 1, pattern (0) repeated 8; pattern (1) size 1, pattern (1) repeated 8; Or 0000000 11111111 pattern (0000000) size 8, pattern (0000000) repeated 8; pattern (11111111) size 8, pattern (11111111) repeated 1;

Which of the above achieves the highest and lowest compression ratios?

Thanks in advance.

Solución

Those are all sequences that would be very unlikely to be seen in the wild. What is the point of the question?

Run-of-the-mill compressors are byte-oriented. As such, any pattern that results in simply the same byte repeated will give the highest compression ratio. E.g. 1032:1 in the limit for deflate. Other simple repetitions of short patterns will get very high compression ratios. E.g. again 1032:1 for deflate for patterns of two or three repeating bytes.

The limit on compression in these absurdly extreme cases is a function of the compression format, not of the data.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow