Theory: Compression algorithm that makes some files smaller but none bigger?

https://stackoverflow.com/questions/1513567

19-09-2019
|

Question

I came across this question;

"A lossless compression algorithm claims to guarantee to make some files smaller and no files larger.
Is this;

a) Impossible

b) Possible but may run for an indeterminate amount of time,

c) Possible for compression factor 2 or less,

d) Possible for any compression factor?"

I'm leaning towards (a), but couldn't give a solid explanation as to why. (I'll list the thoughts a friend and I came up with as a possible answer)

Solution

By the pigeon-hole principle, given a string of 10 bits you have 1024 possible inputs, and need to map to 9 bits or fewer, so there are < 1024 outputs.

This guarantees either the algorithm has collisions (lossy compression) or at some point choses to return the unmodified input as output.

In the latter case, you cannot determine how to decompress an arbitrary string of bits. (It could be an unmodified input, or a compressed output from a larger bit string).

-> Impossible.

OTHER TIPS

Just a slight clarification of RJFalconer's post...

You only have to have some files becoming smaller, so the claim that a string of 10 bits has to map to 9 bits or fewer isn't quite right. In particular, if someone proposed such a compression mechanism it could map all strings of 10 bits or fewer to exactly the same output (i.e. an identity transformation).

However, we are told that there is at least one file which does become smaller. Without loss of generality, consider that to start with x bits and end up as y bits, where y is strictly less than x.

Now consider the domain of "files with y bits or fewer", which has 2^y+1-1 bit strings (including the empty one). In order for none of those to result in a bigger file, each has to map to a bit string in the same domain, i.e. 2^y+1-1 compressed files. However, we already know that the initial string of length x bits compresses to one of those values - leaving only 2^y+1-2 possible values.

At this point the pigeon hole principle comes in - you clearly can't map 2^y+1-1 inputs to 2^y+1-2 outputs without repeating an output, which violates the reversibility of compression.

a) impossible

If you have a file that can not be compressed further, you still have to add the information whether it has been compressed or not, so in that case the file would have to grow.

I know that I'm kinda late, but I found this via google and someone else could do the same, so I'll post my answer: the obvious solution is a) impossible, as well pointed out by Jon Skeet (and btw there are a lot of proofs all over the internet). I'm not questioning the impossibility to compress random data, just to be clear from the beginning; I understood the theory that lays behind it, and -if you ask me- I trust the math. : D

But, if we're allowed to think laterally, we could definitely take advantage of the fact that the question is not well-defined, meaning that it does not give a strict definition of "compression algorithm" and of the properties that it should have (but to reduce some files without expanding anyone else).

Also, it doesn't put whatsoever condition on the files to be compressed, the only thing it's interested in is "to make some files smaller and no files larger".

That said, we have now at least two ways to show that, in fact, it does exist such an algorithm:

We can exploit the name of the file to store some of the information of the file (or even the entire file, if the file system allows it, thus reducing every file to 0 bit). Trivially, we could simply decide leave untouched every file but one, reducing it to 0 bit and renaming it with a predefined name. I agree that this could be considered cheating, but then again, there are no restrictions in the initial question and this algorithm would effectively achieve the purpose (as long as no one renames the file, that's why this would be a very poor design choice besides being pointless).
We can limit the number of files to be compressed, say, to the ones at least X bits long. Once again, a trivial solution would be to leave every file untouched but one, that we can reduce making it match to a file smaller than X bits. Now we do have an algorithm which, quoting verbatim, makes some files smaller and no files larger; however, it performs a restriction on all its possible inputs (i.e. it cannot process all the files).

To those who argue that this wouldn't have any practical use, I say that I agree with you... but hey, this is theory, and this was just a theoretical dissertation. ;)

Obviously, if I were to do a test and face this question, I'd put a bold X on the a), and then just go on without thinking too much about it.

Nevertheless, it is perfectly possible to show that, since natural language is intrinsically ambiguous and the question is not formally expressed, each of the other possible answers is not necessarily wrong: placing the right conditions and eventually specifying more clearly what is meant by certain concepts, we may legally be able to meet the goal of any of the other listed options, doing some sort of trickery and forcing the program to achieve the desired behavior.

e) Possible

...with some restrictions.

I recently came across Shoco, a string compression library for small strings. I was reminded of this question when reading this claim:

...the most remarkable property of shoco is that the compressed size will never exceed the size of your input string, provided it is plain ASCII.

If you are sure that the input data is plain ASCII, your out buffer for only needs to be as large as the input string

http://ed-von-schleck.github.io/shoco/#how-it-works

possible

to make some files smaller and no files larger

if said compression algorithm makes the file larger, just have it return the original file.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow