Organization of Biological Networks Workshop Talk
March 9th, 2008This week-long workshop was a highlight of my time at IMA. I really enjoyed the diversity and excitement of this new field. Here’s a video of my talk on HIV drug resistance evolution.
This week-long workshop was a highlight of my time at IMA. I really enjoyed the diversity and excitement of this new field. Here’s a video of my talk on HIV drug resistance evolution.
Yesterday I talked about applications of potential information to experiment planning, using the example of a robot seeking to discover the principles of genetics from the initial observation of a “mutant” pea plant with white flowers. You can listen to the audio (right click on the audio link, and Save Link As, then listen to the downloaded file using QuickTime player or Real player). I also captured most of the material I wrote on the whiteboard. Some relevant background material (and detailed exposition of the RoboMendel example) is also available.
Well, I failed to record my seminar audio, but here are some relevant notes for material discussed in the third seminar. This time we discussed the application of information metrics to experiment planning, rather than just model selection. One metric that I emphasized this time is the notion of potential information, which provides a signal for whether the current model needs to be expanded because its fit to the observations is inadequate. The attached material discusses some concrete examples of potential information, for example, for experiment planning.
I gave a talk today on my lab’s work on HIV drug resistance evolution and conditional selection pressure “networks”. The slides and audio are available here (to download the audio right-click the link and select Save Link As…). To listen to the audio you can use the QuickTime player or Real player. For our publications on this topic, see our publications page.
Here are my slides for my second talk at IMA on Feb. 14. I tried to introduce some problems with typical information metrics as they apply to statistical inference problems. Then I describe empirical information, my preferred information metric for statistical inference. The slides are available as a PDF, and the audio of the talk is also available — you can use either RealPlayer or the Quicktime player to listen to this. To download the audio, right-click on this link and choose Save Link As…
I’ve also posted some background material cut from different chapters of my draft textbook as a PDF.
Here are a couple of items relevant to my Feb. 7 intro session at IMA:
I’ve been hacking a bit with Silverlight, Microsoft’s environment for running dynamic languages like Python and Ruby directly within a web browser like Firefox or IE. It seems to work quite well, and it’s easy to get Python code up and running in Silverlight. Indeed, it’s been surprisingly easy to get quite significant chunks of Python running in Silverlight — specifically, pygr, including large portions of code originally written for Pyrex (a mixed C / Python language environment). First I’ll describe my impressions of Silverlight and its implications for Python, then what I’ve accomplished with it.
I think Silverlight potentially has a lot of importance for Python for several reasons. Its basic function is fairly amazing: it lets you run Python inside your web browser with complete control over the web page contents and interactivity. Those of us who feel that the web browser is the logical place for many kinds of interactive documents, visualization and rapid prototyping, but want to use a real language (instead of Javascript!), finally have a solution. This is exciting! It is thrilling to be creating interactive web pages whose interactivity is all done by Python code — forget all that JavaScript!
But Pythonistas should also pay attention to Silverlight as a potential threat: this is Microsoft’s attempt to hijack Python’s many virtues and put them to work for Microsoft. For example, Silverlight runs IronPython, and as such lacks most of the Python standard libraries that Pythonistas know and rely on. In their place it gives you access to .NET. Similarly, it redirects you away from open standards like SVG and into Microsoft’s WPF (Windows Presentation Framework?). The concern here is that huge numbers of .Net and web developers will be taught a Python that is Not Python, i.e. they will be taught to write code that can’t run in standard Python, or indeed anywhere outside of Microsoft’s IronPython. The good news, I think, is that Python is too dynamic to box in this way; it is very easy to port Python standard libraries to run in Silverlight. I think Pythonistas should push very hard to provide as complete as possible a standard library for IronPython and Silverlight, so that developers in those environments are not forced over to the Dark Side.
Another interesting possibility is that Silverlight is the “restricted execution environment” Python has always lacked. Python code running in Silverlight supposedly has no access to the file system or system calls, and no way to read or alter user files. Since it runs in the web browser, and executes code sent by potentially malicious websites, logically it would have to be secure in this sense, or it will be worse than useless for Microsoft’s purposes. Taking a cynical view, if Silverlight takes off, hackers are going to be attacking it from every angle, and Microsoft will be forced to patch up the holes. So over time, Silverlight could evolve into a safe “sandbox” where you can run code that you want to confine to very limited privileges.
OK, so what have I done with Silverlight?
import xmlrpclib
server = xmlrpclib.ServerProxy('http://localhost:8000')
result = server.some_method(*args)I will post a tar of the source code tomorrow, and update this page to link to it. But now to bed.
Let’s examine the principles for a general process of learning, via a scientific example: could we program a robot to make scientific discoveries directly from raw observations? As an example problem, let’s take the discovery of the basic principles of genetics by Gregor Mendel and subsequent researchers.
Preliminaries: Modeling Species
Before we can even think about asking our robot to discover genetics, we must consider how it would become “aware'’ of the basics of biology through its standard modeling process. This process has several steps.
1) formulate an uninformative root model (e.g. a constant density over the whole region in which observations are possible). We can think of this as our “zero information'’ model.
2) identify the region of observation space with maximum potential information relative to our current model (initially, just the root model). This simply requires performing some type of smoothing function over the observation space, and computing relative entropy density versus our current model. Distinct peaks will be identified by regions of positive relative entropy density, surrounded by regions of negative relative entropy density. Thus peaks can be defined as regions above some relative entropy density threshold (near zero). We can identify the peak with maximum integrated potential information, and construct a model (e.g. a normal distribution) for the observations in that region. We set the prior probability `\lambda_j` of this model to maximize the total probability of the observations.
`\Pr(X_1 X_2…X_n,f_1,f_2…f_m)=(\prod_{i=1}^n{\sum_{j=0}^m{\lambda_j \Pr(X_i|f_j)}})(\prod_{j=0}^m{\Pr(f_j|f_0)})`
where `f_j` represents the `j`th model (`f_0` is the root model), and $\lambda_j$ represents the inferred weights of the models ($\sum_{j=0}^m{\lambda_j}=1$). Note that this expression treats the models as being emitted from the root model just as the observations are emitted from the models, i.e. $f_0 \rightarrow f_j \rightarrow X_i$. We will examine such multilevel modeling approaches in detail later. Finally, we accept the new model if the total probability increase (odds ratio for including the new model vs. omitting it) is above some chosen confidence threshold.
3) Simply repeat step 2, using the new density combining the previous model and the new model. Stop when we can no longer find any confident new models.
This poses a clear problem for a conventional clustering procedure. Which level of peaks should it detect? If it clusters the observations into individuals, then the species clustering is lost (the species is split into individuals), and vice versa. Clearly the right answer is to recognize the existence of both levels of clustering: i.e. that the raw observations are emitted by individuals, who are in turn emitted by the distribution representing their species. This is just another layer in our multilevel model: i.e. $f_0 \rightarrow f_j \rightarrow g_k \rightarrow X_i$ where $f_0$ is the root model, $f_j$ is a species model, $g_k$ is an individual model, and $X_i$ is a raw observation.
We lose our ability to detect individuals when their peaks overlap (e.g. if many similar individuals are observed at once). If our robot is stationed in the arctic and sees a similar looking wolf every day, how can it tell if it’s always the same individual, or whether it’s many different wolves (maybe all wolves look identical)? This forces us to collect a new type of observation, tracking individuals, e.g. by marking a given individual with a unique tag, or by following that one wolf continuously through time so we know we’re observing the same individual.
How might we formulate this model? Focusing for the moment on asexual reproduction (i.e. an instance of species $x$ generates a new instance of species $x$), we can express this model as a trivial graph $x \rightarrow x$, i.e. a single node $x$, with a single outgoing edge pointing to itself. We can imagine this model emitting many “species models'’ via the following likelihood function:
`p(obs|model)=\begin{array}{l c}
1 & \text{if {\em obs} has self-edge} \\
& \\
0 & \text{if {\em obs} has non-self edge}
\end{array}`
Under this model, the probability of $wolf \rightarrow wolf$ is one, but the probability of $wolf \rightarrow mouse$ is zero. Thus this “bio-object model'’ emits many individual species models: e.g. $mouse \rightarrow mouse$, $pea \rightarrow pea$ etc.). The bio-object model becomes a new layer in the multi-level modeling: $f_0 \rightarrow b \rightarrow f_j \rightarrow g_k \rightarrow X_i$ where $f_0$ is the root model, $b$ is the bio-object model (consisting of the likelihood function given above), $f_j$ is a species model, $g_k$ is an individual model, and $X_i$ is a raw observation. Of course, there will also be a “dead-matter'’ model, which emits types of objects that do not follow the bio-object rule (e.g. sand, automobiles, etc.).
The Mutant Observation
In this context, let’s consider Mendel’s observation that some pea plants give white flowers (instead of the usual purple flowers; I’ll abbreviate these as Wh and Pu). Our current models provide two possible models of this observation:
2. “White'’ is a new species: If the white-flowered plants lie outside the multivariate distribution (height, weight, number of flowers, color, etc.) of the pea species, one obvious interpretation within our existing model is that Wh is a new species. Of course, it would then be expected to obey the “species model'’, i.e. $x \rightarrow x$, only giving rise to Wh progeny. This predicts strong expectation information for tracking the progeny of Wh plants, and the progeny of Pu plants: we have high probability that this set of observations will enable us to “discover'’ a new species (add a new species model for the Wh species. Initial observations appear to validate this expectation (the progeny of Wh individuals are themselves Wh and the progeny of Pu individuals are themselves Pu). Thus Wh appears to be a new species.
The crossing experiment
The next experiment to validate the Wh species model is to attempt a cross between Wh and Pu: according to the species model, no progeny should result. At this point Mendel made a crucial observation: Wh and Pu individuals can be crossed, and yield purple progeny: $Wh \times Pu \rightarrow Pu$. This comes as a real shock, because it violates our fundamental model for bio-objects (i.e. the species propagation model).
From Population Model to Individual Causal Theory
This amazing result has some subtle consequences.
A slightly more subtle variant on the same “individual state'’ question is to ask whether the $\frac{3}{4}Pu$ are really all $Pu$; after all, some of them might be $Hy$. We can resolve this by testing each individual to see what progeny it gives. This yields another interesting experimental result: the hybrid cross actually gives a mix of progeny: $Wh \times Pu \rightarrow Hy \rightarrow \frac{1}{4}Wh + \frac{1}{2}Hy + \frac{1}{4}Pu$
| Parent | $Wh$ | $Hy$ | $Pu$ |
| $Wh$ | $Wh$ | $\frac{1}{2}Wh+\frac{1}{2} Hy$ | $Hy$ |
| $Hy$ | $\frac{1}{2}Wh+\frac{1}{2} Hy$ | $\frac{1}{4}Wh+\frac{1}{2}Hy+\frac{1}{4}Pu$ | $\frac{1}{2}Hy+\frac{1}{2}Pu$ |
| $Pu$ | $Hy$ | $\frac{1}{2}Hy+\frac{1}{2}Pu$ | $Pu$ |
Sufficient Models vs. Minimal Sufficient Models
This contingency table, consisting of $9\times 2 = 18$ degrees of freedom, is derived directly from observations via Baum-Welch training. It is itself a model, but the strong regularities visible in the table suggest that a simpler underlying model can “explain'’ this table. Only three types of cells (100\%; 50\%; 25-50-25) appear in the table. It is extremely unlikely that this strong “powers of two'’ pattern would appear by random chance from the unconstrained 18-parameter model. This clearly suggests a discrete model, specifically random sampling from a 2-choose-1 process. That is, each individual has two state variables $(x_1,x_2)$ that can each take either 0 or 1 as values. Each state variable is inherited from one of the parents (by random choice of one of the parent’s two state variables). Finally, we have a trivial rule for how these variables affect flower color: $x_1|x_2 \Leftrightarrow purple$ (else $white$), where the vertical bar $|$ means the logical OR operator. I will refer to this as the “haplotype model'’, because in genetics such state variables are called “haplotypes'’.
It’s important to consider the relationship between these two models. First, since they result in the same likelihood function (the probability of the specific types of observations discussed so far), they are observationally equivalent. This is an important point. Collecting more observations of the same type will not help us test one model vs. the other. We must therefore consider these models to form an equivalence class: because they are observationally indistinguishable by definition (at least for this type of observations), it makes no sense to make these models compete with each other in our Bayesian probability calculations (this is double-counting, counting the same model twice, which violates the requirement that probability be summed over disjoint sets). However, it’s important to note that other types of observations might be able to distinguish these models. Whereas our current observations only consider outward characteristics like flower color (broadly categorized by biologists as “phenotype'’, the external manifestations of genetic phenomena), direct observations of “genotype'’ (the actual genetic material) could distinguish these models. Specifically, such observations can provide direct information about the hidden genotype variables: e.g. are there really two copies of each genetic state variable?
Second, both models are “sufficient statistics'’ in the sense that they capture all the information that is in these observations. Specifically, adopting these models reduces the potential information of the observations to zero. Recall that in conventional statistical problems there can be multiple sufficient statistics for a given dataset, which can differ in how “efficiently'’ they condense the information. For the Mendelian genetics problem, the haplotype model captures the pattern of the observations with far fewer parameters than the Baum-Welch $3 \times 3$ contingency table. We can refer to the smallest possible sufficient statistic for the observations as a “minimal sufficient model'’ (in this case, the haplotype model).
It’s useful to consider explicitly what’s “better'’ about the haplotype model than our original $3 \times 3$ contingency table. Two related patterns emerge. First, the contingency table displays striking symmetry constraints (e.g. it is invariant to exchange of $x_1 \leftrightarrow x_2$; it consists solely of powers of two, etc.). Second, the haplotype model involves far fewer parameters. While the contingency table implies 18 distinct degrees of freedom (DOF), the haplotype model consists of only a few pieces of information: the number of state variables:2; the number of possible states for each variable: 2; the random choice of one state variable from parent to child; the logical OR operation to obtain phenotype. According to this model, the 18 apparent DOF are actually constrained to fixed relations with each other by these model constraints. These are just two different ways of saying the same thing: the haplotype model represents the discovery that a large {\em dimensional reduction} of the $3 \times 3$ contingency table is possible. From this point of view, obtaining the “minimal sufficient model'’ from a merely “sufficient model'’ is a matter of recognizing the symmetry group of its internal structure. Alternatively, one can generate competing minimal models by enumerating different symmetry groups, from simple to more complex, and seek the first one that matches well to the sufficient model. We will examine in detail such approaches to discovering dimensional reductions in data later.
However, as important as the haplotype model is to how geneticists think, it is actually not very important whether our robot “figures out'’ this minimal sufficient model at this stage, or simply uses the “sufficient model'’ provided by the Baum-Welch $3 \times 3$ contingency table. For the purposes of modeling
additional genetic problems and gaining insight into new levels of genetic organization, our robot can do just fine with the contingency table. The whole point of “sufficiency'’ is that it is observationally equivalent–indistinguishable by likelihood–and thus does just as good a job of modeling the data as the minimal sufficient model. Our robot can put off trying to solve the minimal sufficient model problem until it encounters some observations that can actually distinguish these two types of models (e.g. direct observations of genotype).
I’m writing a textbook on Information Evolution. Or at least I thought I was — so far it mainly seems to be about the statistical inference side of “information”, as opposed to the “evolution” side. I suspect it will make more sense to make this focus on inference and methodology, and leave the science of how physical systems produce information for a later effort. If you have an interest in the basic issues I’m raising in the posts here, you may want to take a look at the first five draft chapters. That’s where the real meat is.
Does there exist an information metric with truly general utility? If so, a scientist could use it to choose which experiment to do: the best experiment is that one that yields the largest amount of information about the scientist’s question of interest (or, over the long-term, the highest information rate per unit time / expense). Indeed, if the metric were truly general, the scientist could use it to decide which research question is “most interesting” (again, compute the expected information yield for the different research directions). Actually, if such an information metric existed, the “scientist” could just be a robot, because all that is required is the ability to calculate this metric for different possible experiments (observations). This wouldn’t be artificial intelligence in the traditional sense of that field, but instead just a big statistical number-crunching computation. In a way, scientific computing at its dullest.
There are lots of possible objections to this idea, which mainly come down to competing definitions of “information”. I’ll just try to outline these.