Example: Modeling the Discovery of Mendelian Genetics
Let’s examine the principles for a general process of learning, via a scientific example: could we program a robot to make scientific discoveries directly from raw observations? As an example problem, let’s take the discovery of the basic principles of genetics by Gregor Mendel and subsequent researchers.
Preliminaries: Modeling Species
Before we can even think about asking our robot to discover genetics, we must consider how it would become “aware” of the basics of biology through its standard modeling process. This process has several steps.
- simple clustering: a good place to begin is the problem of detecting distinct clusters in raw observational data. Imagine that our robot is equipped with a variety of detectors that allow it to measure different characteristics like weight, height (and other dimensions), color, etc. for every “object” it encounters. Assuming an environment that is relatively sparse in different types of objects, these datawill form distinct peaks in the multidimensional parameter space. We can treat this as a completely conventional clustering problem. We can simply outline a procedure:
1) formulate an uninformative root model (e.g. a constant density over the whole region in which observations are possible). We can think of this as our “zero information” model.
2) identify the region of observation space with maximum potential information relative to our current model (initially, just the root model). This simply requires performing some type of smoothing function over the observation space, and computing relative entropy density versus our current model. Distinct peaks will be identified by regions of positive relative entropy density, surrounded by regions of negative relative entropy density. Thus peaks can be defined as regions above some relative entropy density threshold (near zero). We can identify the peak with maximum integrated potential information, and construct a model (e.g. a normal distribution) for the observations in that region. We set the prior probability `\lambda_j` of this model to maximize the total probability of the observations.
`\Pr(X_1 X_2…X_n,f_1,f_2…f_m)=(\prod_{i=1}^n{\sum_{j=0}^m{\lambda_j \Pr(X_i|f_j)}})(\prod_{j=0}^m{\Pr(f_j|f_0)})`
where `f_j` represents the `j`th model (`f_0` is the root model), and $\lambda_j$ represents the inferred weights of the models ($\sum_{j=0}^m{\lambda_j}=1$). Note that this expression treats the models as being emitted from the root model just as the observations are emitted from the models, i.e. $f_0 \rightarrow f_j \rightarrow X_i$. We will examine such multilevel modeling approaches in detail later. Finally, we accept the new model if the total probability increase (odds ratio for including the new model vs. omitting it) is above some chosen confidence threshold.
3) Simply repeat step 2, using the new density combining the previous model and the new model. Stop when we can no longer find any confident new models.
- Detection of Species: The clusters obtained via this procedure will correspond to distinct groups of objects. These might include some living organisms (e.g. fruitflies, tigers, pea plants) and some “dead” matter (e.g. grains of sand, pieces of driftwood, tin cans). We can think of these as species, each defined entirely by their population distributions.
- Detection of Individuals: What if we look repeatedly at a small region containing not very many objects? This procedure of “sampling with replacement” will resample the same item multiple times. Of course, resampling the same item will not yield identical observation values (e.g. a tiger’s weight might vary depending on whether he just ate; also the robot’s detectors themselves have variation due to noise). Thus each individual will itself be represented by a peak in the observation space, most likely with much narrower variance than that of the entire species population.
This poses a clear problem for a conventional clustering procedure. Which level of peaks should it detect? If it clusters the observations into individuals, then the species clustering is lost (the species is split into individuals), and vice versa. Clearly the right answer is to recognize the existence of both levels of clustering: i.e. that the raw observations are emitted by individuals, who are in turn emitted by the distribution representing their species. This is just another layer in our multilevel model: i.e. $f_0 \rightarrow f_j \rightarrow g_k \rightarrow X_i$ where $f_0$ is the root model, $f_j$ is a species model, $g_k$ is an individual model, and $X_i$ is a raw observation.
We lose our ability to detect individuals when their peaks overlap (e.g. if many similar individuals are observed at once). If our robot is stationed in the arctic and sees a similar looking wolf every day, how can it tell if it’s always the same individual, or whether it’s many different wolves (maybe all wolves look identical)? This forces us to collect a new type of observation, tracking individuals, e.g. by marking a given individual with a unique tag, or by following that one wolf continuously through time so we know we’re observing the same individual.
- The Model-Emits-Model Pattern: Both species and individual are models, inferred from observational data. However, we could argue that individuals act as observations for inferring species. In fact, this is how we actually study populations in most cases: we collect observations about individuals (often collecting multiple observations of one individual to get a good mean value), and then use the sample of individuals to infer the distribution of the species. It would be incorrect to use the raw observations directly to infer the species distribution; for example, we should not include multiple observations of the same individual (they can be treated as independent observations of the individual but not of the species). Indeed, if we think carefully about most of the parameters we consider to be “raw observations” (e.g. “sequence of the PCR product”; “weight of the tiger”), these are actually hidden variables inferred from even “rawer” detector outputs (e.g. chromatogram traces; movement of a pointer on a scale dial). Thus our “observations” are themselves models. From the point of view of multilevel modeling this is perfectly natural: everything is assumed to be emitted by a higher-level model, going back recursively to the root model.
- An obvious question from this point of view is how do individuals emit individuals? Here our robot can make a very interesting discovery: biological reproduction. Individuals give birth to new individuals, always of the same species. This introduces a new meaning for “emits” (our robot can actually watch the new individual emerging from the first individual): $\forall g_1,g_2:g_1 \rightarrow g_2 \implies \exists f: f \rightarrow g_1,f \rightarrow g_2$. The species-emits-individual relation can be thought of as “individual $g$ is an instance of species $f$” (membership of a distribution). By contrast, the individual-emits-individual relation is a “causal theory” of how individuals are actually created. If our robot is alert, it may notice that actually two individuals, male and female, are needed to produce a “child” individual: $g_1 \times g_2 \rightarrow g_3$.
- The Bio-object Model: This pattern itself becomes a model, in the following way. Because multiple species (fruitflies, tigers, pea plants etc.) all display this same pattern, while other “species” (sand, rocks, tin cans etc.) do not obey this model, our robot will become convinced that this model is a distinct cluster. This model gets added to its repertoire of models that can be used to emit new species instances. In other words, this model becomes a class, of which many species are member instances (analogous to “individual is a member of species”). We can refer to this model as the “Bio-object” model, since it captures precisely what defines living organisms as distinct from other types of “objects”.
How might we formulate this model? Focusing for the moment on asexual reproduction (i.e. an instance of species $x$ generates a new instance of species $x$), we can express this model as a trivial graph $x \rightarrow x$, i.e. a single node $x$, with a single outgoing edge pointing to itself. We can imagine this model emitting many “species models” via the following likelihood function:
`p(obs|model)=\begin{array}{l c}
1 & \text{if {\em obs} has self-edge} \\
& \\
0 & \text{if {\em obs} has non-self edge}
\end{array}`
Under this model, the probability of $wolf \rightarrow wolf$ is one, but the probability of $wolf \rightarrow mouse$ is zero. Thus this “bio-object model” emits many individual species models: e.g. $mouse \rightarrow mouse$, $pea \rightarrow pea$ etc.). The bio-object model becomes a new layer in the multi-level modeling: $f_0 \rightarrow b \rightarrow f_j \rightarrow g_k \rightarrow X_i$ where $f_0$ is the root model, $b$ is the bio-object model (consisting of the likelihood function given above), $f_j$ is a species model, $g_k$ is an individual model, and $X_i$ is a raw observation. Of course, there will also be a “dead-matter” model, which emits types of objects that do not follow the bio-object rule (e.g. sand, automobiles, etc.).
The Mutant Observation
In this context, let’s consider Mendel’s observation that some pea plants give white flowers (instead of the usual purple flowers; I’ll abbreviate these as Wh and Pu). Our current models provide two possible models of this observation:
- Wh is just part of random variation within the pea species: One possible model is to propose that Wh is part of the pea species, i.e. that the Wh observations were emitted from the standard “pea species model” Pu. There are two problems here:
- Wh and Pu are distinct peaks: If the Wh observations simply represented the tail of the Pu distribution, they could be considered to be part of the Pu distribution. (If the Wh observations fit within the density of the Pu tail, they would have zero potential information). However, the Wh observations form a separate peak that is many standard deviations distant from the Pu peak. They do not appear to be explainable as part of the Pu peak. There are two distinct peaks here, implying two different models.
- Wh is not randomly distributed: Even if we wanted to argue that Wh is emitted from the Pu peak as part of its random variation, we run into an immediate problem. Wh does not behave like random variation but instead occurs in a highly non-random pattern. That is, instead of each flower having a certain probability of randomly being white or purple, we observe non-random clustering by individual: all the flowers on a given individual plant are white, or all of them are purple. This strong segregation is more what we would expect if Wh and Pu were separate species.
2. “White” is a new species: If the white-flowered plants lie outside the multivariate distribution (height, weight, number of flowers, color, etc.) of the pea species, one obvious interpretation within our existing model is that Wh is a new species. Of course, it would then be expected to obey the “species model”, i.e. $x \rightarrow x$, only giving rise to Wh progeny. This predicts strong expectation information for tracking the progeny of Wh plants, and the progeny of Pu plants: we have high probability that this set of observations will enable us to “discover” a new species (add a new species model for the Wh species. Initial observations appear to validate this expectation (the progeny of Wh individuals are themselves Wh and the progeny of Pu individuals are themselves Pu). Thus Wh appears to be a new species.
The crossing experiment
The next experiment to validate the Wh species model is to attempt a cross between Wh and Pu: according to the species model, no progeny should result. At this point Mendel made a crucial observation: Wh and Pu individuals can be crossed, and yield purple progeny: $Wh \times Pu \rightarrow Pu$. This comes as a real shock, because it violates our fundamental model for bio-objects (i.e. the species propagation model).
- Higher-order Potential Information: We already have strong evidence that Wh is not part of the Pu species; now we have strong evidence that it is not a separate species. This apparent contradiction, analogous to a reductio ad absurdam step in a mathematical proof, suggests that our starting assumptions, i.e. the “bio-object model”, are wrong, or at a minimum clearly incomplete. All of this is detectable as potential information, but this time with a difference: instead of indicating extension of the “surface” models that directly emit observations (e.g. add a new species model), this potential information indicates that we need to extend our higher-order models. Our “bio-object model” is apparently no longer adequate to explain the behavior of bio-objects such as pea plants.
- Can the species model be rescued? Actually, there is one way that the species model could be rescued. In Mendel’s day it was well-known that in certain cases, similar species could be crossed, for example, $horse \times donkey \rightarrow mule$, but that the progeny were sterile (unable to reproduce). This is observed often enough in different species that it can be added as an extension of the species model: $x \times x^{\prime} \rightarrow y$, where $y$ has no outgoing edges (no progeny). This model can in turn emit specific cases like $horse \times donkey \rightarrow mule$. So an obvious question is whether the $Wh \times Pu \rightarrow Pu$ result can be explained by this model. This creates expectation information for testing to see if the “hybrid” progeny are sterile.
- The hybrid surprise: This leads to an even more shocking result. Not only can the hybrids reproduce as successfully as normal pea plants, they give a mix of $\frac{1}{4}Wh$ and $\frac{3}{4}Pu$ progeny. This strongly violates the species model, in two ways. First, it should not be possible to cross two species and get fertile offspring. Second, $Wh$ and $Pu$ are “turning into each other” with alarming frequency, in both directions. In the first cross, progeny of a $Wh$ parent come out $Pu$, and now some of their progeny turn out $Wh$!
From Population Model to Individual Causal Theory
This amazing result has some subtle consequences.
- The discovery of hidden variables: Evidently at least some of the hybrid plants must be different from $Pu$ plants, because they give such radically different progeny ($\frac{1}{4}Wh$ from the hybrid group, vs. none from the $Pu$ plants). Yet the individual plants of the two groups are themselves indistinguishable– they both have purple flowers. Just classifying the plants on the basis of the flower color is apparently no longer an adequate description. This suggests that we must switch from modeling the data by simply clustering the observations (e.g. “purple” vs. “white”) to infering a “hidden variable theory”: hidden variables that can explain the observations.
- From population models to individual states: The obvious implication is that we can no longer assume that all individuals of a given “population” are the same just because they are all purple. An obvious solution is that we can distinguish each individual’s “true type” by performing the self-cross progeny test. We have strong expectation information for doing this analysis, since we expect to be able to confidently determine the type of each individual, and we have substantial uncertainty of its true type before we do that experiment. For example, the population average result $hybrid \rightarrow \frac{1}{4}Wh + \frac{3}{4}Pu$ could have very different meanings depending on whether all hybrid individuals give this same result, or whether there might be different types of hybrid individuals that have different progeny patterns. We can easily do this experiment for individual hybrid plants, and there is high expectation information for being able to interpret the results of such experiments. The experiment result is that we find each hybrid individual yields the same proportion of $\frac{1}{4}Wh + \frac{3}{4}Pu$ progeny. This suggests that “hybrid” represents a single individual state; we therefore can give it the symbol $Hy$.
A slightly more subtle variant on the same “individual state” question is to ask whether the $\frac{3}{4}Pu$ are really all $Pu$; after all, some of them might be $Hy$. We can resolve this by testing each individual to see what progeny it gives. This yields another interesting experimental result: the hybrid cross actually gives a mix of progeny: $Wh \times Pu \rightarrow Hy \rightarrow \frac{1}{4}Wh + \frac{1}{2}Hy + \frac{1}{4}Pu$
- This process has taken us decisively away from the “random population” model, in which all the observations were drawn i.i.d. from a single distribution, to a “hidden variable” model, where each individual carries hidden variables that are causal i.e. they determine the outcomes of its matings. For example, the very existence of the $Hy$ category corresponds to a hidden variable; $Hy$ individuals have the same purple flowers as $Pu$ individuals, but are demonstrably different by the self-cross test. Our models have shifted from working at the level of populations (”species”) to working at the level of individuals, and from i.i.d. to relying on hidden variables attached to each individual.
- This immediately gives us strong expectation information to re-do some of the above experiments at the individual (rather than the population) level, i.e. we must track progeny by individuals rather than by populations. This generates large amounts of new information about the hidden states of individuals. We can refer to this kind of extrapolation (using a newly established principle to collect a new kind of information about many individual cases) as “surveying”.
- For example, one obvious experiment is to cross $Wh \times Hy$, or $Pu \times Hy$. This fills in the last four cells of the $3 \times 3$ contingency table representing all possible crosses (see table). It’s important to understand that this table can be obtained directly from simple Baum-Welch training from the raw observational data.
| Parent | $Wh$ | $Hy$ | $Pu$ |
| $Wh$ | $Wh$ | $\frac{1}{2}Wh+\frac{1}{2} Hy$ | $Hy$ |
| $Hy$ | $\frac{1}{2}Wh+\frac{1}{2} Hy$ | $\frac{1}{4}Wh+\frac{1}{2}Hy+\frac{1}{4}Pu$ | $\frac{1}{2}Hy+\frac{1}{2}Pu$ |
| $Pu$ | $Hy$ | $\frac{1}{2}Hy+\frac{1}{2}Pu$ | $Pu$ |
Sufficient Models vs. Minimal Sufficient Models
This contingency table, consisting of $9\times 2 = 18$ degrees of freedom, is derived directly from observations via Baum-Welch training. It is itself a model, but the strong regularities visible in the table suggest that a simpler underlying model can “explain” this table. Only three types of cells (100\%; 50\%; 25-50-25) appear in the table. It is extremely unlikely that this strong “powers of two” pattern would appear by random chance from the unconstrained 18-parameter model. This clearly suggests a discrete model, specifically random sampling from a 2-choose-1 process. That is, each individual has two state variables $(x_1,x_2)$ that can each take either 0 or 1 as values. Each state variable is inherited from one of the parents (by random choice of one of the parent’s two state variables). Finally, we have a trivial rule for how these variables affect flower color: $x_1|x_2 \Leftrightarrow purple$ (else $white$), where the vertical bar $|$ means the logical OR operator. I will refer to this as the “haplotype model”, because in genetics such state variables are called “haplotypes”.
It’s important to consider the relationship between these two models. First, since they result in the same likelihood function (the probability of the specific types of observations discussed so far), they are observationally equivalent. This is an important point. Collecting more observations of the same type will not help us test one model vs. the other. We must therefore consider these models to form an equivalence class: because they are observationally indistinguishable by definition (at least for this type of observations), it makes no sense to make these models compete with each other in our Bayesian probability calculations (this is double-counting, counting the same model twice, which violates the requirement that probability be summed over disjoint sets). However, it’s important to note that other types of observations might be able to distinguish these models. Whereas our current observations only consider outward characteristics like flower color (broadly categorized by biologists as “phenotype”, the external manifestations of genetic phenomena), direct observations of “genotype” (the actual genetic material) could distinguish these models. Specifically, such observations can provide direct information about the hidden genotype variables: e.g. are there really two copies of each genetic state variable?
Second, both models are “sufficient statistics” in the sense that they capture all the information that is in these observations. Specifically, adopting these models reduces the potential information of the observations to zero. Recall that in conventional statistical problems there can be multiple sufficient statistics for a given dataset, which can differ in how “efficiently” they condense the information. For the Mendelian genetics problem, the haplotype model captures the pattern of the observations with far fewer parameters than the Baum-Welch $3 \times 3$ contingency table. We can refer to the smallest possible sufficient statistic for the observations as a “minimal sufficient model” (in this case, the haplotype model).
It’s useful to consider explicitly what’s “better” about the haplotype model than our original $3 \times 3$ contingency table. Two related patterns emerge. First, the contingency table displays striking symmetry constraints (e.g. it is invariant to exchange of $x_1 \leftrightarrow x_2$; it consists solely of powers of two, etc.). Second, the haplotype model involves far fewer parameters. While the contingency table implies 18 distinct degrees of freedom (DOF), the haplotype model consists of only a few pieces of information: the number of state variables:2; the number of possible states for each variable: 2; the random choice of one state variable from parent to child; the logical OR operation to obtain phenotype. According to this model, the 18 apparent DOF are actually constrained to fixed relations with each other by these model constraints. These are just two different ways of saying the same thing: the haplotype model represents the discovery that a large {\em dimensional reduction} of the $3 \times 3$ contingency table is possible. From this point of view, obtaining the “minimal sufficient model” from a merely “sufficient model” is a matter of recognizing the symmetry group of its internal structure. Alternatively, one can generate competing minimal models by enumerating different symmetry groups, from simple to more complex, and seek the first one that matches well to the sufficient model. We will examine in detail such approaches to discovering dimensional reductions in data later.
However, as important as the haplotype model is to how geneticists think, it is actually not very important whether our robot “figures out” this minimal sufficient model at this stage, or simply uses the “sufficient model” provided by the Baum-Welch $3 \times 3$ contingency table. For the purposes of modeling
additional genetic problems and gaining insight into new levels of genetic organization, our robot can do just fine with the contingency table. The whole point of “sufficiency” is that it is observationally equivalent–indistinguishable by likelihood–and thus does just as good a job of modeling the data as the minimal sufficient model. Our robot can put off trying to solve the minimal sufficient model problem until it encounters some observations that can actually distinguish these two types of models (e.g. direct observations of genotype).