Difference between “information” and “useful information” in algorithmic information theory

https://cs.stackexchange.com/questions/945

16-10-2019
|

Pergunta

According to Wikipedia:

Informally, from the point of view of algorithmic information theory, the information content of a string is equivalent to the length of the shortest possible self-contained representation of that string.

What is the analogous informal rigorous definition of "useful information"? Why is "useful information" not taken as the more natural or more fundamental concept; naively it seems a purely random string must by definition contain zero information, so I'm trying to get my head around the fact that it is considered to have maximal information by the standard definition.

Solução

The central concept here is Kolmogorov complexity, and more specifically compressibility. To get a intuitive feeling of compressibility, consider two strings $A \in \mathbb{B}^*$ and $B \in \mathbb{B}^*$, where $\mathbb{B} = \{ 0,1 \}$. Let

$A = 1010$ $1010$ $1010$ $1010$, and

$B = 1011$ $0110$ $0111$ $1001$.

Note that $|A| = |B| = 16$. How could we quantify how much information $A$ or $B$ has? If we think about classical information theory, in general, transmitting a string of length $n$ takes $n$ bits on average. However we cannot say how many bits we need to transmit a specific string of length $n$.

Why is the information content of a random string not zero?

On a closer look, we can see that in fact $A = 10^8$. However, it is much harder to say if $B$ has any obvious patterns in its structure, at least it seems and feels more random than $A$. Because we can find a pattern in $A$, we can easily compress $A$ and represent it with less than $16$ bits. Likewise, since it is not easy to detect any patterns in $B$, we cannot compress it as much. Therefore we can say that $B$ has more information than $A$. Moreover, a random string of length $n$ has maximal information since there is no way we can compress it, and hence represent it with less than $n$ bits.

What is useful information, then?

For useful information, yes, there is a definition using a Turing machine $T$. The useful information in $x \in \mathbb{B}^*$ is

$$\min_T \space \{\space l(T) + C(x|T) : T \in \{ T_0, T_1, ... \} \},$$

where $l(T)$ denotes the length of a self-limiting encoding for a Turing machine $T$. The notation is usually such that $C(x)$ denotes the Kolmogorov complexity of $x$ and $C(x|y)$ the conditional Kolmogorov complexity of $x$ given $y$.

Here $T$ embodies the amount of useful information contained in $x$. What we could ask is which such $T$ to select among those that satisfy the requirement. The problem is to separate a shortest program $x^*$ into parts $x^* = pq$ s.t. $p$ represents an appropriate $T$. This is actually the very idea that spawned minimum description length (MDL).

Outras dicas

It could be because "useful" is hard to define. Say we have a highly-structured, information-rich message $x$ which can be compressed at most by a factor of $\alpha$ to the message $y$. Intuitively, $x$ and $y$ contain the same amount of useful information; indeed, they contain the same amount of information according to the usual definition. Now imagine a prefix $z$ of $x$ of the same length as $y$; it should contain no more useful information than $x$, hence, no more than $y$. However, $y$ is more "random" than $z$, since $z$ can be compressed and $y$ can't. So if we try to associate "useful" information with compressibility, we could run into the following paradox: a prefix of a message could have higher "useful" information than the entire message, seemingly a contradiction.

From a less formal point of view, I think it may help if you detach yourself from the word "random," as you're correct that a set of truly random bits don't store any information in a practical sense. (If I encrypt a set of names and send the encrypted values to you, they may have very high Kolmogorov complexity but it's not going to help you figure out the names).

But think about it in this way. If you see a website in a foreign language (say Swedish, assuming you don't speak it) it's going to look more or less random. There will be some order to the words, but not much. However, if you look at a webpage with text that looks like this: 123456123456123456123456... and so on, you'll be able to understand it more quickly. If you don't speak Swedish you'll probably be able to get much more out of it, even if the Swedish webpage said the equivalent of "the first six numbers repeated sequentially". The websites contain the same information, but one looks random to you. And for the amount of space, the one you understand is way less efficient than the Swedish webpage, even though it stores the same information. You may not find this information "useful" because it's in Swedish, but the information is still there.

The notion of "information" is meant to be universal, so what looks like random--and therefore useless--bits to you may store a great deal of information to someone else. The measure of information is intended to be an intrinsic property of the string, and cannot depend on what does and doesn't make sense to you, and what you can and can't interpret.

Another (more technical) point that may help is that I'm being slightly disingenuous here. As Juho points out, information is defined relative to who's interpreting it. You may find the Swedish webpage completely useless as a vehicle for information, but someone who speaks Swedish may find it to have a great deal of information. The definition does reflect this. However, from the mathematics we can learn that the difference between the shortest (most informative for the space) webpage to communicate this website to you and the shortest webpage that can communicate it to someone who speaks Swedish can differ only by an additive constant. Why? Because for you, as a non-Swedish speaker, the shortest way to store the page that you can understand is "the first six integers repeated sequentially." This may be quite a bit longer than the Swedish. (Bear with me here and assume that Swedish is super short and efficient whereas English is very long and wasteful).

But even if you were able to speak Swedish, you'd only be able to cut an additive constant from the length! Why? Because you could always go buy a Swedish-English dictionary. Then the super-short Swedish webpages would make sense to you. Sure, they only make sense when you have the dictionary, but the dictionary has a constant length. So $$(\mbox{Most efficient representation of information in English}) \leq (\mbox{Most efficient representation in Swedish}) + (\mbox{Length of Swedish-English dictionary})$$. This is getting a bit off-topic from your original question, but the point I'm trying to make is that it doesn't matter too much who is reading the information. The random-looking Swedish webpage was not "useful" to you, but it's "useful" to someone else, and you're only a constant amount of information away from being able to make use of it yourself.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange