Frage

I'm writing an article about the Census Bureau's population projections through 2060, which consists of a 3.3 MB .csv file when uncompressed.

The file consists of 539,781 values, each of which is 5-7 digits, and takes up 3,455,372 characters. When I gzip the file it comes down to 1550063 bytes, or 1.47 MB.

I want to be able to truthfully state that it would fit on a 3.5-inch floppy, max capacity 1.44 MB. This is just a reference point, not advice to a user that requires instructions on how to do so.

Is there a way to calculate the theoretical size of a text file based on the character count above? If we actually had a 3.5-inch floppy and a drive for it, would it be possible to get this file on the disk without information loss? Thanks!

War es hilfreich?

Lösung

No, it is not possible to estimate the size of a compressed version of a file based purely on its character count. Different strings can be compressed at different levels of efficiency; a string made purely of one character will be much more easily compressed than a string of purely randomly generated characters.

In information theory, there is a concept of Kolmogorov complexity, which is (more or less) the smallest amount of information necessary to reconstruct a string. Not all strings an be compressed into smaller strings, and it is impossible to build a general algorithm to find the Kolmogorov complexity of an arbitrary string. Moreover, it's impossible to prove that you have found the optimal encoding for a string once the string ets sufficiently long.

Hope this helps!

Andere Tipps

If you want to say it fits on a 1.44 MB floppy, then just prove it with a better compressor. Try 7-Zip or xz (depending on your platform). You are close enough that I'm sure that will do the trick. (Did you use gzip -9?)

By the way, I'm not sure about the utility of this, since many people will have no clue what in the world you're talking about when you describe this "floppy disk" thing to them.

As already noted, is it not possible to calculate the theoretical best compression. Just use the best compressors to get an estimate.

Update:

Downloaded it. xz compressed it to 1177180 bytes. So yes, it fits.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top