The General Information Metric Hypothesis

Does there exist an information metric with truly general utility? If so, a scientist could use it to choose which experiment to do: the best experiment is that one that yields the largest amount of information about the scientist’s question of interest (or, over the long-term, the highest information rate per unit time / expense). Indeed, if the metric were truly general, the scientist could use it to decide which research question is “most interesting” (again, compute the expected information yield for the different research directions). Actually, if such an information metric existed, the “scientist” could just be a robot, because all that is required is the ability to calculate this metric for different possible experiments (observations). This wouldn’t be artificial intelligence in the traditional sense of that field, but instead just a big statistical number-crunching computation. In a way, scientific computing at its dullest.
There are lots of possible objections to this idea, which mainly come down to competing definitions of “information”. I’ll just try to outline these.

  • complexity vs. information: standard information theory metrics like Shannon entropy or Kolmogorov complexity are measures of complexity, which is clearly related to but not the same as “information”. The usual interpretation placed on this is that information is a reduction in complexity. The substitution paradox illustrates this question. Say we take the text of the Gettysberg Address and gradually randomize it (by swapping random pairs of characters in the text). Intuitively this operation reduces the information content. It does indeed increase the text’s complexity (as measured by entropy or Kolmogorov, indicating increasing randomness), consistent with the proposed inverse relationship between complexity and information. On the other hand, if we instead gradually replace individual characters of the text with a fixed character (e.g. zeros), this reduces both the complexity metrics and intuitive information content. This appears to disprove the “reduction in complexity” definition of information. Does this leave us with no usable definition of an information metric?
  • observable vs. hidden metrics of information: the Shannon and Kolmogorov metrics treat complexity as an observable property. That is, given a set of observations obs, the complexity is a uniquely determined value, just a function K(obs). Another way of saying this is that the complexity metrics are model-free; that is, they have no model component that can give different values of the metric under different alternative models. On the other hand, there are many reasons to think that information is a hidden parameter, i.e. the information content of a text is not the letters themselves (the observables), but its meaning. For example, this trivially resolves the substitution paradox: for a text T, modified text T’ and meaning `\mu`, `Pr(M|T’)` forms a Markov chain `T’ \rightarrow T \rightarrow \mu`, and the mutual information `I(T’;\mu)\rightarrow 0` for either shuffling or erasure of `T \rightarrow T’`. But most importantly, this point of view brings to bear the full machinery of statistical inference (i.e. Bayes Law and ways to compute it) on the problem of defining an information metric. In my view that is a Good Thing. But it does open the Pandora’s box of models, i.e. Bayes Law in principle means a summation over an infinite series of all possible models, and “reading the information content of a set of observations” turns into “discovering the model that best fits the available observations”. That (or rather the inference-information theory fusion part of it) will be the main topic of this blog. For example, a key problem is enumeration: can you obtain a generator function that enumerates models (terms of the infinite series) in descending order of their prior probabilities?
  • finite bounds: one simple but crucial test for a proposed information metric is that it must remain bounded (finite) for any number of observations N from a stationary distribution. i.e. `I(obs^N)<C` for `N \rightarrow \infty`, where `C` is some constant. For example, if we point the first gamma ray detector ever invented at the sun, and record for 10 days, we expect to learn something, but if we record its output for N times as long we don’t expect N times as much information content (assuming the sun’s emissions are a stationary distribution). As N grows, eventually the information content should saturate (i.e. we’re just seeing “the same old thing” over and over). This simple consideration is actually useful for rejecting some plausible information metrics. For example, say a set of observations are emitted from a normal distribution. We might propose to use the decrease in entropy of the posterior distributions for the sufficient statistics `Pr(\mu,\theta|obs)` as the information metric. As we collect initial observations, this entropy goes down, yielding information. The problem is that it keeps going down without bound (e.g. the variance for the posterior distribution of `\mu` goes down as `\frac{1}{N}`, and the posterior entropy heads towards `-\infty`). That would make the information metric increase without bound. One can try to finesse this by emphasizing that no detector has infinite resolution (so we have to separate systematic error from random (sampling) error), but still it indicates that there is a basic problem with this direction. On the other hand, if we define information as measuring our ability to predict observables, this problem goes away.
  • the No Free Lunch theorem: this theorem states that no optimization algorithm can perform better on average over all possible cost functions, than a completely random algorithm. This is sometimes interpreted as a rigorous demonstration that there is no such thing as a general method of problem solving, since over the set of all possible optimization problems (cost functions), NO such method could ever do better than random. (Indeed, its authors present it that way). However, it’s a long way from the very specific assumptions and results of the NFL theorem to the rather different conditions of real-world statistical inference. Does the NFL theorem actually apply to statistical inference problems and rule out the possibility of a general information metric? Personally, I don’t think so, as I’ll discuss in detail below.

Leave a Reply