## Chemistry 100 Genomics & Bioinformatics Course for Life Science undergrads

June 27th, 2012

This course was first offered Spring 2012, and I want to invite UCLA life science students to consider taking it next year in Spring quarter (2013):

Whether your interests lie in research, industry or medicine, the exciting new field of genomics will play a crucial role in your future work. Genome data and high-throughput technologies for rapidly collecting genome-wide experimental data are already revolutionizing how scientists make discoveries, translate them into new medicine, and how doctors diagnose and treat patients in the clinic. With these extraordinary new opportunities come new challenges: understanding how to use genomics experiments and data requires a new set of skills and knowledge, quite different from classical biochemistry or molecular biology thinking.

In this class you will learn the key foundations of this field:

• genomics technologies and experiments: you will understand the power of new technologies such as microarrays, next-generation sequencing, functional genomics, proteomics, epigenomics and others.
• computational analysis of genomics data (bioinformatics): you will gain hands-on skill in using powerful bioinformatics tools to analyze massive genomics datasets.
• how to think genomically: how to design new kinds of genomics experiments and how to overcome the key challenges for interpreting the meaning of the data.

Note:

• this class assumes no previous background in programming.
• this class requires no programming. Instead, the class projects focus on learning how to answer biological questions from genomics data, using existing computational tools.
• this is an introductory course. It can serve as a gateway to many additional genomics and bioinformatics courses at UCLA, or graduate training in this rapidly growing field.

## Intro Bioinfo Theory Course Example Release 1

February 9th, 2012

#### Introduction

What is the purpose of this release?

It illustrates the kind of content I am releasing as open source course materials, to bootstrap an open source bioinformatics teaching materials consortium where instructors can selectively use, modify and share materials for their own teaching. If you’re interested in using these materials or participating in such a consortium, I invite you to contribute your thoughts or feedback in the Comments section below, or by email to leec@chem.ucla.edu.

What is this?

This is a snapshot of the reading, lectures, homework, projects, practice exams, and exams from my 2011 Bioinformatics Theory course offered separately as a CS undergrad course and Bioinformatics graduate course (different exams; separate graduate term project). The course uses a core set of simple genetics, sequence analysis and phylogeny problems to teach fundamental principles that arise in virtually all bioinformatics problems. This course is not for students who want to learn to use existing methods (e.g. BLAST) but rather for students who might in the future want to invent new bioinformatics analyses. It emphasizes statistical inference, graph models and computational complexity.

Note: this is not a standard lecture course; approximately half the class time was devoted to in-class concept tests, where the class was presented with a question that tests conceptual understanding. Students answered concept tests using an open-response (i.e. not multiple choice) in-class response system by typing answers on their laptops or smartphones. We then discussed our answers using Peer Instruction methods, and I analyzed all the individual students’ answers in detail; at the subsequent class, I went through each of conceptual errors the students made for each question. I have written approximately 200 concept tests for a wide range of statistical and computational concepts relevant to bioinformatics, and a wide variety of "problems-to-work" (i.e. more conventional homework problems) covering the same material. I am making all of these materials and software available as open source; this is the first step in that release process.

## Info-Evo Journal Club meeting Wed. Oct. 19 3:30 pm

October 17th, 2011

We’ll be having our first Info-Evo journal club meeting on Wednesday, Oct. 19, at 3:30 pm, in the third floor conference room of Boyer Hall. This will be an organizational meeting, to discuss what topics we want to cover etc. Please come if you are interested.

## Open Peer Review by a Selected-Papers Network

July 2nd, 2011

Update 2: we are working on implementing this, initially as an open peer review site for arXiv.  See the forum discussion here, and the code repository here.  You are invited to join this effort!

Update: this paper is now published, as a substantially shorter version (that’s a good thing!).  I suggest you read the published version; if you have comments or feedback you are welcome to post comments here.

Abstract

A Selected-Papers (SP) Network is a network in which researchers who read, write and review articles subscribe to each other based on common interests. Instead of reviewing a manuscript in secret for the Editor of a journal, each reviewer simply publishes his review (typically of a paper he wishes to recommend) on the SP network, which automatically forwards it to his subscribers. I present a three phase plan for building a basic SP network, discovering and measuring the detailed structure of research communities, and transforming the SP network itself into an effective publisher of research articles in areas that are not well-supported by existing journals. I show how the SP network provides a new way of measuring impact, catalyzes the emergence of new subfields, and accelerates discovery in existing fields, by providing each reader a fine-grained filter for high-impact.
Read the rest of this entry »

## New “Open Science” Resources Page

May 3rd, 2011

I’m working on a variety of projects in service of the open sourcing of science, by which I simply mean enabling everyone in the scientific community to use their own tools and channels (under their control) for publishing; sharing data; teaching and learning etc. I’ve added a page that collects together my various efforts.

## General information metrics for automated experiment planning talk video available.

May 3rd, 2011

I just posted my talk on empirical information metrics for experiment planning.  This talk describes empirical information metrics for measuring the “information value” of an experiment either before you actually do it (experiment planning), or after (measuring how much information is contained in the empirical data, for improving our total prediction power). In addition to outlining the basic theory, presents some example applications (e.g. the information value of including a control experiment), and a real-world application to the computational design of a new genomics experiment, “phenotype sequencing”, for identifying the genetic causes of a phenotype directly from sequencing of multiple independent mutants.  Presented in the UCLA Chemistry & Biochemistry Department faculty luncheon series, May 2, 2011.

## Phenotype Sequencing Software and Documentation Released

February 20th, 2011

We’ve released our phenoseq software package for phenotype sequencing experimental data analysis and simulation. It can score genes as likely causes of a phenotype, based on sequencing data. It can also perform simulations of possible phenotype sequencing experiment designs, to assess the prospects for success. Phenoseq is implemented in Python, and uses numpy / scipy for fast calculations. For example, it can simulate about 600,000 mutant genomes per second on my early-2008 MacBookPro.

The code is available at Github, and the docs are available here.

## Phenotype Sequencing Video Online

February 18th, 2011

If you’d rather hear about our new phenotype sequencing method than read about it, I’ve posted a video of my phenotype sequencing talk here. Enjoy!

## Phenotype Sequencing Paper Published

February 18th, 2011

Our phenotype sequencing paper has been published by PLoS ONE. This paper presents a new general approach that can identify the genetic causes of a phenotype directly from sequencing of independent mutants, at minimal cost. Follow the link for the full text of the paper; here’s the abstract.

Random mutagenesis and phenotype screening provide a powerful method for dissecting microbial functions, but their results can be laborious to analyze experimentally. Each mutant strain may contain 50 – 100 random mutations, necessitating extensive functional experiments to determine which one causes the selected phenotype. To solve this problem, we propose a "Phenotype Sequencing" approach in which genes causing the phenotype can be identified directly from sequencing of multiple independent mutants. We developed a new computational analysis method showing that 1. causal genes can be identified with high probability from even a modest number of mutant genomes; 2. costs can be cut many-fold compared with a conventional genome sequencing approach via an optimized strategy of library-pooling (multiple strains per library) and tag-pooling (multiple tagged libraries per sequencing lane). We have performed extensive validation experiments on a set of E. coli mutants with increased isobutanol biofuel tolerance. We generated a range of sequencing experiments varying from 3 to 32 mutant strains, with pooling on 1 to 3 sequencing lanes. Our statistical analysis of these data (4099 mutations from 32 mutant genomes) successfully identified 3 genes (acrB, marC, acrA) that have been independently validated as causing this experimental phenotype. It must be emphasized that our approach reduces mutant sequencing costs enormously. Whereas a conventional genome sequencing experiment would have cost $7,200 in reagents alone, our Phenotype Sequencing design yielded the same information value for only$1200. In fact, our smallest experiments reliably identified acrB and marC at a cost of only $110 –$340.

## Empirical information metrics paper published

February 2nd, 2011

It is common to measure the information value of a model as its average prediction power for some observable variable of interest. Then the absolute goodness-of-fit of a statistical model to a set of observations can be formulated as the total remaining information obtainable by the set of all models that we have not yet computed (one of which might fit the observations much better than our current model). In a paper just published in Information I define this metric as the potential information, and show that it can estimated directly from the observations, without actually computing any of the remaining models.

This addresses a simple question in Bayesian inference: how do we know when we’re done? Bayesian inference is widely used in many disciplines, because it provides a general framework for evaluating the strength of evidence for a list of competing theories \Psi_i, given a set of experimental observations obs . If all problems could be solved by computing a short list of possible models (theories), this would be a good general strategy. In real-world scientific inference, however, we cannot a priori assume that the possibilities can be limited to a fixed list of models. So in practice we face a set of all possible models that is effectively infinite (or at least unmanageably large), of which we only calculate a small subset of terms. This raises the unsettling possibility that the correct model \Omega may not even be included in the subset of terms that we calculated.

Specifically, the probability of a model \Psi_i given a set of observations obs is calculated via Bayes’ Law:

p(\Psi_i|obs) = \frac{p(obs|\Psi_i)p(\Psi_i)}{p(obs)}

where the denominator p(obs) is calculated via the expansion p(obs)=\sum_i{p(obs|\Psi_i)p(\Psi_i)}. If the set of all possible models \Psi_i is infinite, we will only be able to calculate this sum for a subset of terms \Psi_1 … \Psi_n. This underestimates the total sum p(obs) and therefore overestimates the probability of any model p(\Psi_i|obs), perhaps grossly. The real question is whether the correct model \Omega was included in the calculated terms \Psi_1 … \Psi_n or not. Since by definition \Omega maximizes p(obs|\Psi), if included it may dominate the sum, and therefore our calculated probabilities may be reasonably accurate (in which case we are “done”). But if not, then they may be very inaccurate, and we would need to calculate more terms of the model series in the hopes of finding \Omega. So how do we know whether we’re done?

Unfortunately, Bayes’ Law does not answer this question. Intuitively, if the calculated subset of models is a poor fit to the observations, this will be reflected in a very low value for the calculated probability of the observations p(obs) — much smaller than “it should be”. So how large should p(obs) be? Again, Bayes’ Law does not answer this.

This question is directly relevant to understanding the scientific method mathematically, because it is related to Popper’s criterion of falsifiability, namely that a scientific theory is only useful if it makes predictions that could be shown to be wrong by experiments. Translated into Bayesian terms, this means showing that the fit of the calculated model terms to the experimental observations is not “good enough” — precisely the capability that Bayes’ Law lacks.

The potential information metric solves this problem by measuring the maximum amount of new information obtainable by computing all the remaining terms of the infinite model set. Its most interesting property is that we can measure it without actually computing any more terms of the infinite model set. For details on the metric and its relations with traditional information theory, see the paper.