Organization of Biological Networks Workshop Talk

March 9th, 2008

This week-long workshop was a highlight of my time at IMA. I really enjoyed the diversity and excitement of this new field. Here’s a video of my talk on HIV drug resistance evolution.

Seminar 4 at IMA

February 29th, 2008

Yesterday I talked about applications of potential information to experiment planning, using the example of a robot seeking to discover the principles of genetics from the initial observation of a “mutant” pea plant with white flowers. You can listen to the audio (right click on the audio link, and Save Link As, then listen to the downloaded file using QuickTime player or Real player). I also captured most of the material I wrote on the whiteboard.  Some relevant background material (and detailed exposition of the RoboMendel example) is also available.

Notes for Third IMA Seminar

February 22nd, 2008

Well, I failed to record my seminar audio, but here are some relevant notes for material discussed in the third seminar. This time we discussed the application of information metrics to experiment planning, rather than just model selection. One metric that I emphasized this time is the notion of potential information, which provides a signal for whether the current model needs to be expanded because its fit to the observations is inadequate. The attached material discusses some concrete examples of potential information, for example, for experiment planning.

HIV Drug Resistance Evolution Talk at IMA

February 20th, 2008

I gave a talk today on my lab’s work on HIV drug resistance evolution and conditional selection pressure “networks”. The slides and audio are available here (to download the audio right-click the link and select Save Link As…). To listen to the audio you can use the QuickTime player or Real player. For our publications on this topic, see our publications page.

Empirical Information as a metric for Statistical Inference

February 15th, 2008

Here are my slides for my second talk at IMA on Feb. 14. I tried to introduce some problems with typical information metrics as they apply to statistical inference problems. Then I describe empirical information, my preferred information metric for statistical inference. The slides are available as a PDF, and the audio of the talk is also available — you can use either RealPlayer or the Quicktime player to listen to this. To download the audio, right-click on this link and choose Save Link As…

I’ve also posted some background material cut from different chapters of my draft textbook as a PDF.

Chapter 1 on probabilistic inference

February 7th, 2008

Here are a couple of items relevant to my Feb. 7 intro session at IMA:

Pygr.Data in A Web Browser!

November 26th, 2007

I’ve been hacking a bit with Silverlight, Microsoft’s environment for running dynamic languages like Python and Ruby directly within a web browser like Firefox or IE. It seems to work quite well, and it’s easy to get Python code up and running in Silverlight. Indeed, it’s been surprisingly easy to get quite significant chunks of Python running in Silverlight — specifically, pygr, including large portions of code originally written for Pyrex (a mixed C / Python language environment). First I’ll describe my impressions of Silverlight and its implications for Python, then what I’ve accomplished with it.

I think Silverlight potentially has a lot of importance for Python for several reasons. Its basic function is fairly amazing: it lets you run Python inside your web browser with complete control over the web page contents and interactivity. Those of us who feel that the web browser is the logical place for many kinds of interactive documents, visualization and rapid prototyping, but want to use a real language (instead of Javascript!), finally have a solution. This is exciting! It is thrilling to be creating interactive web pages whose interactivity is all done by Python code — forget all that JavaScript!

But Pythonistas should also pay attention to Silverlight as a potential threat: this is Microsoft’s attempt to hijack Python’s many virtues and put them to work for Microsoft. For example, Silverlight runs IronPython, and as such lacks most of the Python standard libraries that Pythonistas know and rely on. In their place it gives you access to .NET. Similarly, it redirects you away from open standards like SVG and into Microsoft’s WPF (Windows Presentation Framework?). The concern here is that huge numbers of .Net and web developers will be taught a Python that is Not Python, i.e. they will be taught to write code that can’t run in standard Python, or indeed anywhere outside of Microsoft’s IronPython. The good news, I think, is that Python is too dynamic to box in this way; it is very easy to port Python standard libraries to run in Silverlight. I think Pythonistas should push very hard to provide as complete as possible a standard library for IronPython and Silverlight, so that developers in those environments are not forced over to the Dark Side.

Another interesting possibility is that Silverlight is the “restricted execution environment” Python has always lacked. Python code running in Silverlight supposedly has no access to the file system or system calls, and no way to read or alter user files. Since it runs in the web browser, and executes code sent by potentially malicious websites, logically it would have to be secure in this sense, or it will be worse than useless for Microsoft’s purposes. Taking a cynical view, if Silverlight takes off, hackers are going to be attacking it from every angle, and Microsoft will be forced to patch up the holes. So over time, Silverlight could evolve into a safe “sandbox” where you can run code that you want to confine to very limited privileges.

OK, so what have I done with Silverlight?

  • I ported a fair number of missing standard library modules to get my own code working. The most significant are pickle, types, xmllib, xmlrpclib, and urllib (only some functions; Silverlight already provides http connection objects). This was amazingly easy.
  • I mainly did this to create an XML-RPC connection back to the server. Silverlight 1.1 alpha refresh is restricted to only access the same server that the code was retrieved from. I’ve written a combined HTTP/XML-RPC server (in Python, naturally, it’s dead easy) so that my Python code in Silverlight can communicate with the server via XML-RPC. This is incredibly easy, just as you’re used to doing in Python:
    import xmlrpclib
    server = xmlrpclib.ServerProxy('http://localhost:8000')
    result = server.some_method(*args)

    I will make all this available so Pythonistas can start playing with this by quickly launching the server and simply pointing their web browser at it.
  • I’ve got almost all of Pygr, including Pygr.Data, running in Silverlight. The one module that required some porting was cnestedlist, which was written in a combination of C and Python. I had to convert some things to pure Python and get rid of other things that aren’t relevant in Silverlight (e.g. file IO).
  • I have pygr.Data working great in Silverlight. That is, the server is running an XML-RPC pygr.Data server, and in Silverlight I can just import pygr.Data as usual and immediately do everything I would usually do i.e. get both sequence databases / sequences and alignment databases (NLMSA) and work with them exactly as if I were in my usual Python interpreter. This is a little mind blowing. I can’t wait to demo this with access to a 28 vertebrate genome alignment in Firefox!
  • I’ve been learning all this in part via Michael Foord’s very helpful Silverlight / IronPython pages. I used his simple interactive console code to try all this out by hand. But I hit a lot of crashes that killed Firefox. At first I just thought Silverlight 1.1 alpha is really buggy, but finally I realized that the problem is not Silverlight itself, but apparently the interactive console. When I just run my code directly in Silverlight (not through his interactive console) the crashing goes away. I wrote my own much simpler interactive console to eliminate all that crashing.
  • Things I haven’t figured out: proper traceback printing; Silverlight’s odd handling of hierarchical module names. It doesn’t seem to be able to handle them, so to start with I just put the pygr modules in a single directory and access them as import seqdb instead of from pygr import seqdb. I had to kluge pickle.unpickling to follow this convention too.

I will post a tar of the source code tomorrow, and update this page to link to it. But now to bed.

Example: Modeling the Discovery of Mendelian Genetics

March 2nd, 2007

Let’s examine the principles for a general process of learning, via a scientific example: could we program a robot to make scientific discoveries directly from raw observations? As an example problem, let’s take the discovery of the basic principles of genetics by Gregor Mendel and subsequent researchers.

Preliminaries: Modeling Species

Before we can even think about asking our robot to discover genetics, we must consider how it would become “aware'’ of the basics of biology through its standard modeling process. This process has several steps.

  • simple clustering: a good place to begin is the problem of detecting distinct clusters in raw observational data. Imagine that our robot is equipped with a variety of detectors that allow it to measure different characteristics like weight, height (and other dimensions), color, etc. for every “object'’ it encounters. Assuming an environment that is relatively sparse in different types of objects, these datawill form distinct peaks in the multidimensional parameter space. We can treat this as a completely conventional clustering problem. We can simply outline a procedure:

1) formulate an uninformative root model (e.g. a constant density over the whole region in which observations are possible). We can think of this as our “zero information'’ model.

2) identify the region of observation space with maximum potential information relative to our current model (initially, just the root model). This simply requires performing some type of smoothing function over the observation space, and computing relative entropy density versus our current model. Distinct peaks will be identified by regions of positive relative entropy density, surrounded by regions of negative relative entropy density. Thus peaks can be defined as regions above some relative entropy density threshold (near zero). We can identify the peak with maximum integrated potential information, and construct a model (e.g. a normal distribution) for the observations in that region. We set the prior probability `\lambda_j` of this model to maximize the total probability of the observations.

`\Pr(X_1 X_2…X_n,f_1,f_2…f_m)=(\prod_{i=1}^n{\sum_{j=0}^m{\lambda_j \Pr(X_i|f_j)}})(\prod_{j=0}^m{\Pr(f_j|f_0)})`

where `f_j` represents the `j`th model (`f_0` is the root model), and $\lambda_j$ represents the inferred weights of the models ($\sum_{j=0}^m{\lambda_j}=1$). Note that this expression treats the models as being emitted from the root model just as the observations are emitted from the models, i.e. $f_0 \rightarrow f_j \rightarrow X_i$. We will examine such multilevel modeling approaches in detail later. Finally, we accept the new model if the total probability increase (odds ratio for including the new model vs. omitting it) is above some chosen confidence threshold.

3) Simply repeat step 2, using the new density combining the previous model and the new model. Stop when we can no longer find any confident new models.

  • Detection of Species: The clusters obtained via this procedure will correspond to distinct groups of objects. These might include some living organisms (e.g. fruitflies, tigers, pea plants) and some “dead'’ matter (e.g. grains of sand, pieces of driftwood, tin cans). We can think of these as species, each defined entirely by their population distributions.
  • Detection of Individuals: What if we look repeatedly at a small region containing not very many objects? This procedure of “sampling with replacement'’ will resample the same item multiple times. Of course, resampling the same item will not yield identical observation values (e.g. a tiger’s weight might vary depending on whether he just ate; also the robot’s detectors themselves have variation due to noise). Thus each individual will itself be represented by a peak in the observation space, most likely with much narrower variance than that of the entire species population.

This poses a clear problem for a conventional clustering procedure. Which level of peaks should it detect? If it clusters the observations into individuals, then the species clustering is lost (the species is split into individuals), and vice versa. Clearly the right answer is to recognize the existence of both levels of clustering: i.e. that the raw observations are emitted by individuals, who are in turn emitted by the distribution representing their species. This is just another layer in our multilevel model: i.e. $f_0 \rightarrow f_j \rightarrow g_k \rightarrow X_i$ where $f_0$ is the root model, $f_j$ is a species model, $g_k$ is an individual model, and $X_i$ is a raw observation.

We lose our ability to detect individuals when their peaks overlap (e.g. if many similar individuals are observed at once). If our robot is stationed in the arctic and sees a similar looking wolf every day, how can it tell if it’s always the same individual, or whether it’s many different wolves (maybe all wolves look identical)? This forces us to collect a new type of observation, tracking individuals, e.g. by marking a given individual with a unique tag, or by following that one wolf continuously through time so we know we’re observing the same individual.

  • The Model-Emits-Model Pattern: Both species and individual are models, inferred from observational data. However, we could argue that individuals act as observations for inferring species. In fact, this is how we actually study populations in most cases: we collect observations about individuals (often collecting multiple observations of one individual to get a good mean value), and then use the sample of individuals to infer the distribution of the species. It would be incorrect to use the raw observations directly to infer the species distribution; for example, we should not include multiple observations of the same individual (they can be treated as independent observations of the individual but not of the species). Indeed, if we think carefully about most of the parameters we consider to be “raw observations'’ (e.g. “sequence of the PCR product'’; “weight of the tiger'’), these are actually hidden variables inferred from even “rawer'’ detector outputs (e.g. chromatogram traces; movement of a pointer on a scale dial). Thus our “observations'’ are themselves models. From the point of view of multilevel modeling this is perfectly natural: everything is assumed to be emitted by a higher-level model, going back recursively to the root model.
  • An obvious question from this point of view is how do individuals emit individuals? Here our robot can make a very interesting discovery: biological reproduction. Individuals give birth to new individuals, always of the same species. This introduces a new meaning for “emits'’ (our robot can actually watch the new individual emerging from the first individual): $\forall g_1,g_2:g_1 \rightarrow g_2 \implies \exists f: f \rightarrow g_1,f \rightarrow g_2$. The species-emits-individual relation can be thought of as “individual $g$ is an instance of species $f$'’ (membership of a distribution). By contrast, the individual-emits-individual relation is a “causal theory'’ of how individuals are actually created. If our robot is alert, it may notice that actually two individuals, male and female, are needed to produce a “child'’ individual: $g_1 \times g_2 \rightarrow g_3$.
  • The Bio-object Model: This pattern itself becomes a model, in the following way. Because multiple species (fruitflies, tigers, pea plants etc.) all display this same pattern, while other “species'’ (sand, rocks, tin cans etc.) do not obey this model, our robot will become convinced that this model is a distinct cluster. This model gets added to its repertoire of models that can be used to emit new species instances. In other words, this model becomes a class, of which many species are member instances (analogous to “individual is a member of species'’). We can refer to this model as the “Bio-object'’ model, since it captures precisely what defines living organisms as distinct from other types of “objects'’.

How might we formulate this model? Focusing for the moment on asexual reproduction (i.e. an instance of species $x$ generates a new instance of species $x$), we can express this model as a trivial graph $x \rightarrow x$, i.e. a single node $x$, with a single outgoing edge pointing to itself. We can imagine this model emitting many “species models'’ via the following likelihood function:

`p(obs|model)=\begin{array}{l c}
1 & \text{if {\em obs} has self-edge} \\
& \\
0 & \text{if {\em obs} has non-self edge}
\end{array}`

Under this model, the probability of $wolf \rightarrow wolf$ is one, but the probability of $wolf \rightarrow mouse$ is zero. Thus this “bio-object model'’ emits many individual species models: e.g. $mouse \rightarrow mouse$, $pea \rightarrow pea$ etc.). The bio-object model becomes a new layer in the multi-level modeling: $f_0 \rightarrow b \rightarrow f_j \rightarrow g_k \rightarrow X_i$ where $f_0$ is the root model, $b$ is the bio-object model (consisting of the likelihood function given above), $f_j$ is a species model, $g_k$ is an individual model, and $X_i$ is a raw observation. Of course, there will also be a “dead-matter'’ model, which emits types of objects that do not follow the bio-object rule (e.g. sand, automobiles, etc.).

The Mutant Observation

In this context, let’s consider Mendel’s observation that some pea plants give white flowers (instead of the usual purple flowers; I’ll abbreviate these as Wh and Pu). Our current models provide two possible models of this observation:

  1. Wh is just part of random variation within the pea species: One possible model is to propose that Wh is part of the pea species, i.e. that the Wh observations were emitted from the standard “pea species model'’ Pu. There are two problems here:
  • Wh and Pu are distinct peaks: If the Wh observations simply represented the tail of the Pu distribution, they could be considered to be part of the Pu distribution. (If the Wh observations fit within the density of the Pu tail, they would have zero potential information). However, the Wh observations form a separate peak that is many standard deviations distant from the Pu peak. They do not appear to be explainable as part of the Pu peak. There are two distinct peaks here, implying two different models.
  • Wh is not randomly distributed: Even if we wanted to argue that Wh is emitted from the Pu peak as part of its random variation, we run into an immediate problem. Wh does not behave like random variation but instead occurs in a highly non-random pattern. That is, instead of each flower having a certain probability of randomly being white or purple, we observe non-random clustering by individual: all the flowers on a given individual plant are white, or all of them are purple. This strong segregation is more what we would expect if Wh and Pu were separate species.

2. “White'’ is a new species: If the white-flowered plants lie outside the multivariate distribution (height, weight, number of flowers, color, etc.) of the pea species, one obvious interpretation within our existing model is that Wh is a new species. Of course, it would then be expected to obey the “species model'’, i.e. $x \rightarrow x$, only giving rise to Wh progeny. This predicts strong expectation information for tracking the progeny of Wh plants, and the progeny of Pu plants: we have high probability that this set of observations will enable us to “discover'’ a new species (add a new species model for the Wh species. Initial observations appear to validate this expectation (the progeny of Wh individuals are themselves Wh and the progeny of Pu individuals are themselves Pu). Thus Wh appears to be a new species.

The crossing experiment

The next experiment to validate the Wh species model is to attempt a cross between Wh and Pu: according to the species model, no progeny should result. At this point Mendel made a crucial observation: Wh and Pu individuals can be crossed, and yield purple progeny: $Wh \times Pu \rightarrow Pu$. This comes as a real shock, because it violates our fundamental model for bio-objects (i.e. the species propagation model).

  • Higher-order Potential Information: We already have strong evidence that Wh is not part of the Pu species; now we have strong evidence that it is not a separate species. This apparent contradiction, analogous to a reductio ad absurdam step in a mathematical proof, suggests that our starting assumptions, i.e. the “bio-object model'’, are wrong, or at a minimum clearly incomplete. All of this is detectable as potential information, but this time with a difference: instead of indicating extension of the “surface'’ models that directly emit observations (e.g. add a new species model), this potential information indicates that we need to extend our higher-order models. Our “bio-object model'’ is apparently no longer adequate to explain the behavior of bio-objects such as pea plants.
  • Can the species model be rescued? Actually, there is one way that the species model could be rescued. In Mendel’s day it was well-known that in certain cases, similar species could be crossed, for example, $horse \times donkey \rightarrow mule$, but that the progeny were sterile (unable to reproduce). This is observed often enough in different species that it can be added as an extension of the species model: $x \times x^{\prime} \rightarrow y$, where $y$ has no outgoing edges (no progeny). This model can in turn emit specific cases like $horse \times donkey \rightarrow mule$. So an obvious question is whether the $Wh \times Pu \rightarrow Pu$ result can be explained by this model. This creates expectation information for testing to see if the “hybrid'’ progeny are sterile.
  • The hybrid surprise: This leads to an even more shocking result. Not only can the hybrids reproduce as successfully as normal pea plants, they give a mix of $\frac{1}{4}Wh$ and $\frac{3}{4}Pu$ progeny. This strongly violates the species model, in two ways. First, it should not be possible to cross two species and get fertile offspring. Second, $Wh$ and $Pu$ are “turning into each other'’ with alarming frequency, in both directions. In the first cross, progeny of a $Wh$ parent come out $Pu$, and now some of their progeny turn out $Wh$!

From Population Model to Individual Causal Theory

This amazing result has some subtle consequences.

  • The discovery of hidden variables: Evidently at least some of the hybrid plants must be different from $Pu$ plants, because they give such radically different progeny ($\frac{1}{4}Wh$ from the hybrid group, vs. none from the $Pu$ plants). Yet the individual plants of the two groups are themselves indistinguishable– they both have purple flowers. Just classifying the plants on the basis of the flower color is apparently no longer an adequate description. This suggests that we must switch from modeling the data by simply clustering the observations (e.g. “purple'’ vs. “white'̵ ;) to infering a “hidden variable theory'’: hidden variables that can explain the observations.
  • From population models to individual states: The obvious implication is that we can no longer assume that all individuals of a given “population'’ are the same just because they are all purple. An obvious solution is that we can distinguish each individual’s “true type'’ by performing the self-cross progeny test. We have strong expectation information for doing this analysis, since we expect to be able to confidently determine the type of each individual, and we have substantial uncertainty of its true type before we do that experiment. For example, the population average result $hybrid \rightarrow \frac{1}{4}Wh + \frac{3}{4}Pu$ could have very different meanings depending on whether all hybrid individuals give this same result, or whether there might be different types of hybrid individuals that have different progeny patterns. We can easily do this experiment for individual hybrid plants, and there is high expectation information for being able to interpret the results of such experiments. The experiment result is that we find each hybrid individual yields the same proportion of $\frac{1}{4}Wh + \frac{3}{4}Pu$ progeny. This suggests that “hybrid'’ represents a single individual state; we therefore can give it the symbol $Hy$.

A slightly more subtle variant on the same “individual state'’ question is to ask whether the $\frac{3}{4}Pu$ are really all $Pu$; after all, some of them might be $Hy$. We can resolve this by testing each individual to see what progeny it gives. This yields another interesting experimental result: the hybrid cross actually gives a mix of progeny: $Wh \times Pu \rightarrow Hy \rightarrow \frac{1}{4}Wh + \frac{1}{2}Hy + \frac{1}{4}Pu$

  • This process has taken us decisively away from the “random population'’ model, in which all the observations were drawn i.i.d. from a single distribution, to a “hidden variable'’ model, where each individual carries hidden variables that are causal i.e. they determine the outcomes of its matings. For example, the very existence of the $Hy$ category corresponds to a hidden variable; $Hy$ individuals have the same purple flowers as $Pu$ individuals, but are demonstrably different by the self-cross test. Our models have shifted from working at the level of populations (”species'̵ ;) to working at the level of individuals, and from i.i.d. to relying on hidden variables attached to each individual.
  • This immediately gives us strong expectation information to re-do some of the above experiments at the individual (rather than the population) level, i.e. we must track progeny by individuals rather than by populations. This generates large amounts of new information about the hidden states of individuals. We can refer to this kind of extrapolation (using a newly established principle to collect a new kind of information about many individual cases) as “surveying'’.
  • For example, one obvious experiment is to cross $Wh \times Hy$, or $Pu \times Hy$. This fills in the last four cells of the $3 \times 3$ contingency table representing all possible crosses (see table). It’s important to understand that this table can be obtained directly from simple Baum-Welch training from the raw observational data.
Parent $Wh$ $Hy$ $Pu$
$Wh$ $Wh$ $\frac{1}{2}Wh+\frac{1}{2} Hy$ $Hy$
$Hy$ $\frac{1}{2}Wh+\frac{1}{2} Hy$ $\frac{1}{4}Wh+\frac{1}{2}Hy+\frac{1}{4}Pu$ $\frac{1}{2}Hy+\frac{1}{2}Pu$
$Pu$ $Hy$ $\frac{1}{2}Hy+\frac{1}{2}Pu$ $Pu$

Sufficient Models vs. Minimal Sufficient Models

This contingency table, consisting of $9\times 2 = 18$ degrees of freedom, is derived directly from observations via Baum-Welch training. It is itself a model, but the strong regularities visible in the table suggest that a simpler underlying model can “explain'’ this table. Only three types of cells (100\%; 50\%; 25-50-25) appear in the table. It is extremely unlikely that this strong “powers of two'’ pattern would appear by random chance from the unconstrained 18-parameter model. This clearly suggests a discrete model, specifically random sampling from a 2-choose-1 process. That is, each individual has two state variables $(x_1,x_2)$ that can each take either 0 or 1 as values. Each state variable is inherited from one of the parents (by random choice of one of the parent’s two state variables). Finally, we have a trivial rule for how these variables affect flower color: $x_1|x_2 \Leftrightarrow purple$ (else $white$), where the vertical bar $|$ means the logical OR operator. I will refer to this as the “haplotype model'’, because in genetics such state variables are called “haplotypes'’.

It’s important to consider the relationship between these two models. First, since they result in the same likelihood function (the probability of the specific types of observations discussed so far), they are observationally equivalent. This is an important point. Collecting more observations of the same type will not help us test one model vs. the other. We must therefore consider these models to form an equivalence class: because they are observationally indistinguishable by definition (at least for this type of observations), it makes no sense to make these models compete with each other in our Bayesian probability calculations (this is double-counting, counting the same model twice, which violates the requirement that probability be summed over disjoint sets). However, it’s important to note that other types of observations might be able to distinguish these models. Whereas our current observations only consider outward characteristics like flower color (broadly categorized by biologists as “phenotype'’, the external manifestations of genetic phenomena), direct observations of “genotype'’ (the actual genetic material) could distinguish these models. Specifically, such observations can provide direct information about the hidden genotype variables: e.g. are there really two copies of each genetic state variable?
Second, both models are “sufficient statistics'’ in the sense that they capture all the information that is in these observations. Specifically, adopting these models reduces the potential information of the observations to zero. Recall that in conventional statistical problems there can be multiple sufficient statistics for a given dataset, which can differ in how “efficiently'’ they condense the information. For the Mendelian genetics problem, the haplotype model captures the pattern of the observations with far fewer parameters than the Baum-Welch $3 \times 3$ contingency table. We can refer to the smallest possible sufficient statistic for the observations as a “minimal sufficient model'’ (in this case, the haplotype model).

It’s useful to consider explicitly what’s “better'’ about the haplotype model than our original $3 \times 3$ contingency table. Two related patterns emerge. First, the contingency table displays striking symmetry constraints (e.g. it is invariant to exchange of $x_1 \leftrightarrow x_2$; it consists solely of powers of two, etc.). Second, the haplotype model involves far fewer parameters. While the contingency table implies 18 distinct degrees of freedom (DOF), the haplotype model consists of only a few pieces of information: the number of state variables:2; the number of possible states for each variable: 2; the random choice of one state variable from parent to child; the logical OR operation to obtain phenotype. According to this model, the 18 apparent DOF are actually constrained to fixed relations with each other by these model constraints. These are just two different ways of saying the same thing: the haplotype model represents the discovery that a large {\em dimensional reduction} of the $3 \times 3$ contingency table is possible. From this point of view, obtaining the “minimal sufficient model'’ from a merely “sufficient model'’ is a matter of recognizing the symmetry group of its internal structure. Alternatively, one can generate competing minimal models by enumerating different symmetry groups, from simple to more complex, and seek the first one that matches well to the sufficient model. We will examine in detail such approaches to discovering dimensional reductions in data later.

However, as important as the haplotype model is to how geneticists think, it is actually not very important whether our robot “figures out'’ this minimal sufficient model at this stage, or simply uses the “sufficient model'’ provided by the Baum-Welch $3 \times 3$ contingency table. For the purposes of modeling
additional genetic problems and gaining insight into new levels of genetic organization, our robot can do just fine with the contingency table. The whole point of “sufficiency'’ is that it is observationally equivalent–indistinguishable by likelihood–and thus does just as good a job of modeling the data as the minimal sufficient model. Our robot can put off trying to solve the minimal sufficient model problem until it encounters some observations that can actually distinguish these two types of models (e.g. direct observations of genotype).

Textbook chapters 1-5

February 8th, 2007

I’m writing a textbook on Information Evolution. Or at least I thought I was — so far it mainly seems to be about the statistical inference side of “information”, as opposed to the “evolution” side. I suspect it will make more sense to make this focus on inference and methodology, and leave the science of how physical systems produce information for a later effort. If you have an interest in the basic issues I’m raising in the posts here, you may want to take a look at the first five draft chapters. That’s where the real meat is.

The General Information Metric Hypothesis

February 8th, 2007

Does there exist an information metric with truly general utility? If so, a scientist could use it to choose which experiment to do: the best experiment is that one that yields the largest amount of information about the scientist’s question of interest (or, over the long-term, the highest information rate per unit time / expense). Indeed, if the metric were truly general, the scientist could use it to decide which research question is “most interesting” (again, compute the expected information yield for the different research directions). Actually, if such an information metric existed, the “scientist” could just be a robot, because all that is required is the ability to calculate this metric for different possible experiments (observations). This wouldn’t be artificial intelligence in the traditional sense of that field, but instead just a big statistical number-crunching computation. In a way, scientific computing at its dullest.
There are lots of possible objections to this idea, which mainly come down to competing definitions of “information”. I’ll just try to outline these.

  • complexity vs. information: standard information theory metrics like Shannon entropy or Kolmogorov complexity are measures of complexity, which is clearly related to but not the same as “information”. The usual interpretation placed on this is that information is a reduction in complexity. The substitution paradox illustrates this question. Say we take the text of the Gettysberg Address and gradually randomize it (by swapping random pairs of characters in the text). Intuitively this operation reduces the information content. It does indeed increase the text’s complexity (as measured by entropy or Kolmogorov, indicating increasing randomness), consistent with the proposed inverse relationship between complexity and information. On the other hand, if we instead gradually replace individual characters of the text with a fixed character (e.g. zeros), this reduces both the complexity metrics and intuitive information content. This appears to disprove the “reduction in complexity” definition of information. Does this leave us with no usable definition of an information metric?
  • observable vs. hidden metrics of information: the Shannon and Kolmogorov metrics treat complexity as an observable property. That is, given a set of observations obs, the complexity is a uniquely determined value, just a function K(obs). Another way of saying this is that the complexity metrics are model-free; that is, they have no model component that can give different values of the metric under different alternative models. On the other hand, there are many reasons to think that information is a hidden parameter, i.e. the information content of a text is not the letters themselves (the observables), but its meaning. For example, this trivially resolves the substitution paradox: for a text T, modified text T’ and meaning `\mu`, `Pr(M|T’)` forms a Markov chain `T’ \rightarrow T \rightarrow \mu`, and the mutual information `I(T’;\mu)\rightarrow 0` for either shuffling or erasure of `T \rightarrow T’`. But most importantly, this point of view brings to bear the full machinery of statistical inference (i.e. Bayes Law and ways to compute it) on the problem of defining an information metric. In my view that is a Good Thing. But it does open the Pandora’s box of models, i.e. Bayes Law in principle means a summation over an infinite series of all possible models, and “reading the information content of a set of observations” turns into “discovering the model that best fits the available observations”. That (or rather the inference-information theory fusion part of it) will be the main topic of this blog. For example, a key problem is enumeration: can you obtain a generator function that enumerates models (terms of the infinite series) in descending order of their prior probabilities?
  • finite bounds: one simple but crucial test for a proposed information metric is that it must remain bounded (finite) for any number of observations N from a stationary distribution. i.e. `I(obs^N)<C` for `N \rightarrow \infty`, where `C` is some constant. For example, if we point the first gamma ray detector ever invented at the sun, and record for 10 days, we expect to learn something, but if we record its output for N times as long we don’t expect N times as much information content (assuming the sun’s emissions are a stationary distribution). As N grows, eventually the information content should saturate (i.e. we’re just seeing “the same old thing” over and over). This simple consideration is actually useful for rejecting some plausible information metrics. For example, say a set of observations are emitted from a normal distribution. We might propose to use the decrease in entropy of the posterior distributions for the sufficient statistics `Pr(\mu,\theta|obs)` as the information metric. As we collect initial observations, this entropy goes down, yielding information. The problem is that it keeps going down without bound (e.g. the variance for the posterior distribution of `\mu` goes down as `\frac{1}{N}`, and the posterior entropy heads towards `-\infty`). That would make the information metric increase without bound. One can try to finesse this by emphasizing that no detector has infinite resolution (so we have to separate systematic error from random (sampling) error), but still it indicates that there is a basic problem with this direction. On the other hand, if we define information as measuring our ability to predict observables, this problem goes away.
  • the No Free Lunch theorem: this theorem states that no optimization algorithm can perform better on average over all possible cost functions, than a completely random algorithm. This is sometimes interpreted as a rigorous demonstration that there is no such thing as a general method of problem solving, since over the set of all possible optimization problems (cost functions), NO such method could ever do better than random. (Indeed, its authors present it that way). However, it’s a long way from the very specific assumptions and results of the NFL theorem to the rather different conditions of real-world statistical inference. Does the NFL theorem actually apply to statistical inference problems and rule out the possibility of a general information metric? Personally, I don’t think so, as I’ll discuss in detail below.