Info-Evo Journal Club meeting Wed. Oct. 19 3:30 pm

October 17th, 2011

We’ll be having our first Info-Evo journal club meeting on Wednesday, Oct. 19, at 3:30 pm, in the third floor conference room of Boyer Hall. This will be an organizational meeting, to discuss what topics we want to cover etc. Please come if you are interested.

Open Peer Review by a Selected-Papers Network

July 2nd, 2011

Abstract

A Selected-Papers (SP) Network is a network in which researchers who read, write and review articles subscribe to each other based on common interests. Instead of reviewing a manuscript in secret for the Editor of a journal, each reviewer simply publishes his review (typically of a paper he wishes to recommend) on the SP network, which automatically forwards it to his subscribers. I present a three phase plan for building a basic SP network, discovering and measuring the detailed structure of research communities, and transforming the SP network itself into an effective publisher of research articles in areas that are not well-supported by existing journals. I show how the SP network provides a new way of measuring impact, catalyzes the emergence of new subfields, and accelerates discovery in existing fields, by providing each reader a fine-grained filter for high-impact.
Read the rest of this entry »

New “Open Science” Resources Page

May 3rd, 2011

I’m working on a variety of projects in service of the open sourcing of science, by which I simply mean enabling everyone in the scientific community to use their own tools and channels (under their control) for publishing; sharing data; teaching and learning etc. I’ve added a page that collects together my various efforts.

General information metrics for automated experiment planning talk video available.

May 3rd, 2011

I just posted my talk on empirical information metrics for experiment planning.  This talk describes empirical information metrics for measuring the “information value” of an experiment either before you actually do it (experiment planning), or after (measuring how much information is contained in the empirical data, for improving our total prediction power). In addition to outlining the basic theory, presents some example applications (e.g. the information value of including a control experiment), and a real-world application to the computational design of a new genomics experiment, “phenotype sequencing”, for identifying the genetic causes of a phenotype directly from sequencing of multiple independent mutants.  Presented in the UCLA Chemistry & Biochemistry Department faculty luncheon series, May 2, 2011.

Phenotype Sequencing Software and Documentation Released

February 20th, 2011

We’ve released our phenoseq software package for phenotype sequencing experimental data analysis and simulation. It can score genes as likely causes of a phenotype, based on sequencing data. It can also perform simulations of possible phenotype sequencing experiment designs, to assess the prospects for success. Phenoseq is implemented in Python, and uses numpy / scipy for fast calculations. For example, it can simulate about 600,000 mutant genomes per second on my early-2008 MacBookPro.

The code is available at Github, and the docs are available here.

Phenotype Sequencing Video Online

February 18th, 2011

If you’d rather hear about our new phenotype sequencing method than read about it, I’ve posted a video of my phenotype sequencing talk here. Enjoy!

Phenotype Sequencing Paper Published

February 18th, 2011

Our phenotype sequencing paper has been published by PLoS ONE. This paper presents a new general approach that can identify the genetic causes of a phenotype directly from sequencing of independent mutants, at minimal cost. Follow the link for the full text of the paper; here’s the abstract.

Random mutagenesis and phenotype screening provide a powerful method for dissecting microbial functions, but their results can be laborious to analyze experimentally. Each mutant strain may contain 50 – 100 random mutations, necessitating extensive functional experiments to determine which one causes the selected phenotype. To solve this problem, we propose a "Phenotype Sequencing" approach in which genes causing the phenotype can be identified directly from sequencing of multiple independent mutants. We developed a new computational analysis method showing that 1. causal genes can be identified with high probability from even a modest number of mutant genomes; 2. costs can be cut many-fold compared with a conventional genome sequencing approach via an optimized strategy of library-pooling (multiple strains per library) and tag-pooling (multiple tagged libraries per sequencing lane). We have performed extensive validation experiments on a set of E. coli mutants with increased isobutanol biofuel tolerance. We generated a range of sequencing experiments varying from 3 to 32 mutant strains, with pooling on 1 to 3 sequencing lanes. Our statistical analysis of these data (4099 mutations from 32 mutant genomes) successfully identified 3 genes (acrB, marC, acrA) that have been independently validated as causing this experimental phenotype. It must be emphasized that our approach reduces mutant sequencing costs enormously. Whereas a conventional genome sequencing experiment would have cost $7,200 in reagents alone, our Phenotype Sequencing design yielded the same information value for only $1200. In fact, our smallest experiments reliably identified acrB and marC at a cost of only $110 – $340.

Empirical information metrics paper published

February 2nd, 2011

It is common to measure the information value of a model as its average prediction power for some observable variable of interest. Then the absolute goodness-of-fit of a statistical model to a set of observations can be formulated as the total remaining information obtainable by the set of all models that we have not yet computed (one of which might fit the observations much better than our current model). In a paper just published in Information I define this metric as the potential information, and show that it can estimated directly from the observations, without actually computing any of the remaining models.

This addresses a simple question in Bayesian inference: how do we know when we’re done? Bayesian inference is widely used in many disciplines, because it provides a general framework for evaluating the strength of evidence for a list of competing theories \Psi_i, given a set of experimental observations obs . If all problems could be solved by computing a short list of possible models (theories), this would be a good general strategy. In real-world scientific inference, however, we cannot a priori assume that the possibilities can be limited to a fixed list of models. So in practice we face a set of all possible models that is effectively infinite (or at least unmanageably large), of which we only calculate a small subset of terms. This raises the unsettling possibility that the correct model \Omega may not even be included in the subset of terms that we calculated.

Specifically, the probability of a model \Psi_i given a set of observations obs is calculated via Bayes’ Law:

p(\Psi_i|obs) = \frac{p(obs|\Psi_i)p(\Psi_i)}{p(obs)}

where the denominator p(obs) is calculated via the expansion p(obs)=\sum_i{p(obs|\Psi_i)p(\Psi_i)}. If the set of all possible models \Psi_i is infinite, we will only be able to calculate this sum for a subset of terms \Psi_1 … \Psi_n. This underestimates the total sum p(obs) and therefore overestimates the probability of any model p(\Psi_i|obs), perhaps grossly. The real question is whether the correct model \Omega was included in the calculated terms \Psi_1 … \Psi_n or not. Since by definition \Omega maximizes p(obs|\Psi), if included it may dominate the sum, and therefore our calculated probabilities may be reasonably accurate (in which case we are “done”). But if not, then they may be very inaccurate, and we would need to calculate more terms of the model series in the hopes of finding \Omega. So how do we know whether we’re done?

Unfortunately, Bayes’ Law does not answer this question. Intuitively, if the calculated subset of models is a poor fit to the observations, this will be reflected in a very low value for the calculated probability of the observations p(obs) — much smaller than “it should be”. So how large should p(obs) be? Again, Bayes’ Law does not answer this.

This question is directly relevant to understanding the scientific method mathematically, because it is related to Popper’s criterion of falsifiability, namely that a scientific theory is only useful if it makes predictions that could be shown to be wrong by experiments. Translated into Bayesian terms, this means showing that the fit of the calculated model terms to the experimental observations is not “good enough” — precisely the capability that Bayes’ Law lacks.

The potential information metric solves this problem by measuring the maximum amount of new information obtainable by computing all the remaining terms of the infinite model set. Its most interesting property is that we can measure it without actually computing any more terms of the infinite model set. For details on the metric and its relations with traditional information theory, see the paper.

Repair Your Frozen iPad / iPhone / iPod Touch Via SSH

September 19th, 2010

Sometimes the jailbreak giveth and the jailbreak taketh away. Particularly for the iPad, you have to be really careful which Cydia packages you install, because some packages developed for the iPhone will not only not work, but they can brick your device. For an example of how I got into trouble with this, see my previous post.

Fortunately, being jailbroken also gives you tools for solving your problems. In my case, although my iPad seemed completely frozen at the Apple logo, I noticed that iTunes would sync with it apparently OK, implying that the iPad OS was running just fine underneath that frozen graphical interface. So I tried using SSH from my laptop to access the iPad. It worked! Once I had a command line from the iPad, it was easy to repair the problem by just uninstalling the package(s) that caused the problem. Hopefully this may help others who run into similar problems.

Read the rest of this entry »

Backgrounder & MobileSubstrate Will Brick Your iPad

September 12th, 2010

I just hit a very nasty iPad software installation ambush: installing Backgrounder (which in turn installs MobileSubstrate) will make your iPad hang forever at the Apple logo; once it gets in this state, you can’t even turn it off!

Read the rest of this entry »