I know most compression methods rely on some data repeatings in order to be effective. For example the sting "AAAAAaaaQWERTY" can be represented as "5A3aQWERTY" for lossless and something like "8aqwerty" for lossy (these are just for example, not actual working methods). As far as I know, all compression algorithms count on the repeats of ->constant<- strings of characters.

Here comes the problem with the string "abcdefghijklmnopqrstuvwxyz". Here nothing repeats, but as you probably see the information in the string can be represented in far shorter manner. In regex-like str. will be "[a-z]", or maybe "for(x=0;x<25;++){ascii(97+x)}".

Consider also the string "0149162536496481100121" - it can be represented by "for(x=0;x<11;++){x*x}".

The string "ABEJQZer" can be represented by "for(x=0;8;++){ascii(64+x*x)}"

The last two were examples of knowing an algorithm, which can reproduce the original string. I know that in general algorithms (if they are efficient) take far lesser space than the data they can produce.

like in svg images (which have only algorithms in the file) the size is lesser than jpeg.

My question is is there a way of compression, which takes the data and tryes to find efficient algorithms which can represent it. Like vectorizing an raster image ( like in http://vectormagic.com/), which works with other data too.

Consider audio data (for it can be compressed lossy) - some audio edditors (audacity for example) project files contain information like "generate 120Hz constant frequency with 0.8 amplitude from time 0 to time 2 minutes 45.6 seconds" (audacity stores info in xml format). This metadata takes very little memory, and when the project is exported to wav or mp3, the program "renders" the information to actual samples in the exported format.

In that case the compressor should reverse the process of rendering. It should take wav or mp3 file, figure out what algorithms can represent the samples (if it's lossy the algorithms must produce some approximation of the samples - like vectormagic.com approxymates the image) and produce compressed file.

I understand that compression time will be unbelievably long, but are there such (or similar) compression algorithms ?

有帮助吗?

解决方案

All compression methods are like that: the output is a set of parameters for a set algorithms that renders the input, or something similar to the input.

For example MP3 audio codec breaks the input into blocks of 576 samples and converts each block into frequency-amplitude space, and prunes what cannot be heard by a human being. The output is equivalent to "during the next 13 milliseconds play frequencies x,y,z with amplitudes a,b,c". This woks well for audio data, and the similar approach used in JPEG works well for photographic images.

Similar methods can be applied to cases you mention. Sequences such as 987654 or 010409162536 for example are generated by successive values from polynomials, and can be represented as the coefficients of that polynomial, the first one as (9, -1) for 9-x, and the second one as (1,2,1) for 1+2x+x².

The choice of algorithm(s) used to generate the input tends to be fixed for simplicity, and tailored for the use case. For example if you are processing photographic images taken with a digital camera there's little point in even attempting to produce a vectorial output.

其他提示

When trying to losslessly compress some data you always start with creating a model, for example when compressing some text in a human language, you assume, that there are actually not so many words, which repeat over and over. But then, many algorithms try to learn the parameters of the model on the go. Like it doesn't rely on what these words will actually be, it tries to find them for a given input. So the algorithm doesn't rely on the actual language used, but it does rely on the fact, that it is actually a human language, which follows some patterns.

In general, there isn't any perfect algorithm, which can compress anything losslessly, it is mathematically proven. For any algorithm there exist some data for which the compression result is bigger, than the data itself.

You can try data de-duplication:http://en.m.wikipedia.org/wiki/Data_deduplication. It's a little different and more intelligent data compression.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top