Non-precise Input/Using Probability in File Compression

https://softwareengineering.stackexchange.com/questions/384705

18-02-2021
|

Domanda

I'm a high school student interested in topics of computer programming.
Recently I became interested in file compression, and in my head I tried to combine this with a completely different part of programming I heard about a while ago, hash values.

From my understanding the hash value of a file is created by an algorithm which performs a one-way secure function on the contents of the file, to output a code or tag of a given length.
Would it be possible to create a reversible non-secure hash value generator, which would create a value for a file to be compressed, along with hash values for simple variations of the file (such as missing a few characters), and then feed those values along with the given file's length to a program that would through trial and error attempt to construct a file that fit the criteria of the variating hash values?

I know that this would be EXTREMELY resource-intensive on a computer to decompress, but would it be possible? And depending on the nature of the value generator, could there be a way to narrow the range that the answer is found in so as to make it easier on the computer?

Soluzione

Information Theory

Now an N bit-vector can represent 2^N possible states. It is possible with fore knowledge to assign each state to a single valid document of any length. In this sense there is no lossyness, all valid documents can be clearly communicated in the smallest amount of space. There is no encoding scheme that can squeeze any more meaning into a N bit vector. Whatever the length that you peg that vector at dictates how many valid documents you can get. This is a fundamental property of Entropy. I highly recommend reading Claude Shannon's paper on Information Theory. It will take some time to digest.

Perfect Compression

But to your question. Can you somehow drop half (more or less) of the bits in the compressed representation and still extract the document? Yes, and No.

Lets say that K bits were dropped from the message. This is the same as saying that there was noise during communication that obscured the state of those bits. Given only the bits that you received, can you discover the original message?

If this message was perfectly compressed, those missing bits introduced (2^K)-1 other valid documents. These documents are equally valid, and equally expected as the intended document. The receiver can only generate each possibility and say "One of these".

Now that does not mean that the receiver is completely confused. If all of the possible documents have the same property, say the titles are identical, or the timestamps agree on the date etc... Then the receiver can act on that information. On the other-hand, the receiver will need to be more cautious about contradictory properties, and about the validity of an action given the uncertainty in the message.

Tolerating Imperfection

It would be nice if the receiver could dismiss all the (2^K)-1 documents and recover the one true message. But for this to happen the compression scheme has to be less than perfect. Essentially we have introduced at least (2^K)-1 invalid states into the compressed output. This is actually the basis for Self-Healing codes, error mitigating codes, and error detecting codes. I'm not going to go into specifics about how these encodings work on a bit by bit level but think of them as levels of protection.

An error detecting code like a hash, or cyclic-redundancy checksum (CRC) allows the receiver to verify that everything was received as sent to some level of confidence 1 / (2^(C+N)) bits where C are the number of bits in the hash/crc. They may also provide some limited diagnostics as to what went wrong. And paired with an exploration of the bit space suggest potentially valid documents.
An error mitigating code like a grey-code, is assembled in such a way that an error in some K bits only changes the meaning of the message by at most (+/-)cK values off (c being a constant). These are best with continuous values such as frequencies for music, or degrees of rotation or temperature where being a little off is not a serious issue.
An error correcting code allows the exact valid document to be recovered when anywhere up to K error bits have occurred. Past that K bits though it devolves into an error detecting/mitigating code.

Regeneration with just a Hash

In this case the hash is serving as an error detecting code. Unfortunately all N bits of the message have been lost. It is possible to enumerate all bit sequences from n..N bits long, and either compute their hash to verify their validity, or construct the enumerator such that it can only produce valid bit sequences algorithmically. The receiver would need to guess the range of n..N such that the valid message was generated.

Guessing the n..N bit range might be assisted by encoding the length somehow in the hash. Allowing somehow to prove that messages with less than n bits or more than N bits will always be invalid. This will make the Hash function much poorer as a hash of the contents, as it will focus more on a documents length. This will allow the generator to target a specific n..N bit range, but at the cost of recognising more of those bit vectors as valid documents.

Notice though that for this to work as a true compression algorithm, the hash could never have a collision with another valid document. That means there is one unique hash state for each valid document. Which makes the hash the compressed document, just a very hard to decode document. Unfortunately a Hash without collisions is hard to engineer, and usually requires more bits than the original message, making it more of a diffusion than a compression technique. Conversely if collisions are allowed the Hash devolves back toward error mitigating, or error checking depending on how close the valid documents are in terms of meaning.

So Yes, you can send a hash through and the receiver can extract the original message, at great expense of effort.

But No the receiver will also receive a lot of other unintended messages too.

Altri suggerimenti

A hash generator is nothing but a function f which maps some integer domain X (the possible file contents, interpreted as integer numbers) to some integer codomain Y (the possible codes or tags, also interpreted as integer numbers). And if Y contains only numbers of given length (as you wrote in your question), but X contains arbitrary length files, then X has a bigger cardinality than Y, which means f cannot be injective, which means it cannot be reversible.

Adding the concept of additional "hash values for simple variations of the file" is just a modification of the function f, creating a new function g which maps X to the new codomain Y x Y x ... x Y (the cartesian product), where the number of factors is the number of "variations" involved. But if Y contains only numbers of given length, Y x Y x ... x Y is also a finite integer set with a smaller cardinality than X.

Note also if you would losen the constraint on Y and allow "arbitrary length hash codes", this does not automatically guarantee any hash function to be injective. Adding the idea of "variations" won't change this.

Opposed to this, if f is a lossless compression function (which means it is injective), Y needs to have at least the same cardinality as X. Larger files (=larger numbers) in X will typically be mapped to larger values in Y. A good compression function will map many numbers in X which have a lot "redundancy" in them (like repeating sequences) to smaller numbers in Y. But the pigeonhole principle dictates that there will also be numbers in X which are necessarily mapped to larger numbers.

Note that there are indeed injective, reversible functions f which are not used for compression, and not for hashing (to a fixed length code), but for mapping a domain X of integer values to some codomain Y of same size. These functions obfuscate the original input in a way one can only reconstruct the original value by brute-force trial-and-error. This is the idea behind cryptographic functions.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a softwareengineering.stackexchange