Models, Inference & Algorithms | Broad Institute [PDF]

MIA Playlist: Check out and share our growing library of MIA Primer Seminar videos, algorithmically-paired for your peda

6 downloads 20 Views 872KB Size

Report

Download PDF

PNG Network

Recommend Stories

Optimization models and algorithms

There are only two mistakes one can make along the road to truth; not going all the way, and not starting.

statistical models and causal inference

Knock, And He'll open the door. Vanish, And He'll make you shine like the sun. Fall, And He'll raise

data streams: models and algorithms

The butterfly counts not months but moments, and has time enough. Rabindranath Tagore

inference on parameter sets in econometric models

Don’t grieve. Anything you lose comes round in another form. Rumi

Inference in Successive Sampling Discovery Models

Every block of stone has a statue inside it and it is the task of the sculptor to discover it. Mich

Reviving common standards in point-count surveys for broad inference

I cannot do all the good that the world needs, but the world needs all the good that I can do. Jana

Lifted Inference for Relational Continuous Models

Happiness doesn't result from what we get, but from what we give. Ben Carson

Models and Inference for Correlated Count Data

Never wish them pain. That's not who you are. If they caused you pain, they must have pain inside. Wish

Diatom-based inference models and reconstructions revisited

You often feel tired, not because you've done too much, but because you've done too little of what sparks

Graphical Models, Exponential Families, and Variational Inference

Ask yourself: Is there someone who has hurt or angered me that I need to forgive? Next

Idea Transcript

MODELS, INFERENCE & ALGORITHMS

Models, Inference & Algorithms (MIA) is a Broad initiative to support learning and collaboration across the interface of biology and mathematics / statistics / machine learning / computer science. Our core activity is the Wednesday morning meeting in the Monadnock room (415 Main St, 2nd floor), featuring a method primer at 9, a main seminar with breakfast at 10, and a discussion with the speaker at 11. These meetings grew out of the Stat Math Reading Club (SMRC), a series of informal and pedagogical board talks; over a year and a half the talks attracted an ever-larger audience from Broad and the wider Boston community. With MIA we strive to maintain SMRC's essential character, emphasizing lucid exposition of broadly applicable ideas over rapid-fire communication of research results and encouraging questions from the audience throughout. In addition to the weekly meeting there are a number of other MIA activities. Please contact

mia-

[email protected] to be added to our mailing list and learn more. The MIA Initiative is led by Jon Bloom and Alex Bloemendal, and affiliated with the > March 1, 2017 Rafa Gómez-Bombarelli (http://aspuru.chem.harvard.edu/rafa-gomez-bombarelli/) Harvard Chemistry and Chemical Biology Deep learning chemical space : a variational autoencoder for automatic molecular design Abstract: Virtual screening is increasingly proven as a tool to test new molecules for a given application. Through simulation and regression we can gauge whether a molecule will be a promising candidate in an automatic and robust way. A large remaining challenge, however, is how to perform optimizations over a discrete space of size at least 10^60. Despite the size of chemical space, or perhaps precisely because of it, coming up with novel, stable, makeable molecules that are effective is not trivial. Firstprinciples approaches to generating new molecules fail to capture the intuition embedded in the ~100 million existing molecules. I will report our progress towards developing an autoencoder that allows us to project molecular space into a continuous, differentiable representation where we can perform molecular optimization. March 8, 2017 Carl de Boer Regev Lab, Broad Institute Learning the rules of gene regulation with millions of synthetic promoters Abstract: Gene regulatory programs are encoded in the sequence of the DNA. However, how the cell uses transcription factors (TFs) to interpret regulatory sequence remains incompletely known. Synthetic regulatory sequences can provide insight into this logic by providing additional examples of sequences and their regulatory output in a controlled setting. Here, we have measured the gene expression output of tens of millions of unique promoter sequences, whose expressions span a range of 1000-fold, in a controlled reporter construct. This vast > March 29, 2017 Julien de Wit (http://people.csail.mit.edu/stefje/), Nicolas Wieder MIT Earth And Planetary Sciences, Greka Lab A pseudo-random walk from new worlds to diabetes Abstract: For centuries, our understanding of planetary systems has been based on observations of a unique sample, the Solar System. Similarly, our perspective on Life and habitats has remained Earth-centric, leaving millennia-old questions such as "Are we alone? Where/How/When did Life emerge?" unanswered. Two decades ago, the first planet orbiting another star than ours— a.k.a. an exoplanet—was discovered, opening a new chapter of space exploration. Since then, over 3,500 exoplanets have been found in over 2,500 other systems; a sample size increase of three orders of magnitude that has already yielded profound changes in our understanding of planetary systems. Similar changes await our perspective on Life and habitats within the next generation. During this talk, a “Searching for New Worlds 101” will be provided to introduce the TRAPPIST – 1 system, exploring our recent discovery of Earth-sized planets that are both potentially habitable and amenable for in-depth studies with upcoming observatories, and the first insights into their atmospheres, as revealed by the Hubble Space Telescope. At the other end of the scale, biology focuses on chemical processes within cells rather than within atmospheres. A fundamental —and yet mostly overlooked—set of cellular processes gravitates around transient calcium signals. The availability of fast fluorescent calcium indicators allows for the measurements of intracellular calcium and thus provides direct observables of pathological and physiological calcium fluctuations. Calcium signals thereby offer new perspectives to approach a variety of diseases, from diabetes and metabolic disease to Alzheimer's disease. Interestingly, these seemingly diverse fields of biology and planetary sciences share a common cornerstone: (Spectro)Photometric time series. With the arrival of high throughput facilities (e.g. TESS for exoplanetary sciences; FLIPR for biology), the need for standardized data acquisition/processing tools has emerged. The inherent similarity between these fields, in terms of multidisciplinarity and datatype, allows for mutually-beneficial collaborations that need to be leveraged to support the optimal sampling of yet unexplored parameter spaces, and their unbiased interpretation. April 5, 2017 Jesse Engreitz Lander Lab, Broad Institute Grand Challenge: Mapping the regulatory wiring of the genome Abstract: Our cells are controlled by complex molecular instructions encoded in the "noncoding" sequences of our genome, and alterations to these noncoding sequences underlie many common human diseases. The grammar of these noncoding sequences has been difficult to study, but the recent confluence of methods for both high-throughput measurement and high-throughput perturbation offers new opportunities to understand these sequences at a systems level. In this talk, I will highlight outstanding challenges in gene regulation where applying computational approaches in combination with emerging genomics datasets may allow us to build integrated maps that describe the regulatory wiring of the genome. As an example, I will present our efforts to experimentally and computationally map the functional connections between promoters and distal enhancers and use this information to understand human genetic variation in the noncoding genome. April 12, 2017 Matt Johnson (http://people.csail.mit.edu/mattjj/) Google Brain Composing graphical models with neural networks for structured representations and fast inference Abstract: I'll describe a new modeling and inference framework that combines the flexibility of deep learning with the structured representations of probabilistic graphical models. The model family augments latent graphical model structure, like switching linear dynamical systems, with neural network observation likelihoods. To enable fast inference, we show how to leverage graphstructured approximating distributions and, building on variational autoencoders, fit recognition networks that learn to approximate difficult graph potentials with conjugate ones. I'll show how these methods can be applied to learn how to parse mouse behavior from depth video. Scott Linderman (http://www.columbia.edu/~swl2133/) Columbia, Blei Lab Primer: Bayesian time series modeling with recurrent switching linear dynamical systems Abstract: Many natural systems like neurons firing in the brain or basketball teams traversing a court give rise to time series data with complex, nonlinear dynamics. We gain insight into these systems by decomposing the data into segments that are each explained by simpler dynamical units. Bayesian time series models provide a flexible framework for accomplishing this task. This primer will start with the basics, introducing linear dynamical systems and their switching variants. With this background in place, I will introduce a new model class called recurrent switching linear dynamical systems (http://arxiv.org/abs/1610.08466) (rSLDS), which discover distinct dynamical units as well as the input- and state-dependent manner in which units transition from one to another. In practice, this leads to models that generate much more realistic data than standard SLDS. Our key innovation is to design these recurrent SLDS models to enable recent Pólya-gamma auxiliary variable techniques and thus make approximate Bayesian learning and inference in these models easy, fast, and scalable. April 19, 2017 Jerome Kelleher (http://jeromekelleher.net/pages/about.html) Wellcome Trust Centre for Human Genetics, Oxford Simulating, storing and processing genetic variation data for millions of samples Abstract: Coalescent theory has played a key role in modern population genetics and is fundamental to our understanding of genetic variation. While simulation has been essential to coalescent theory from its beginnings, simulating realistic population-scale genome-wide data sets under the exact model was, until recently, considered infeasible. Even under an approximate model, simulating more than a few tens of thousands samples was very time consuming and could take several weeks to complete a single replicate. However, by encoding simulated genealogies using a new data structure (called a tree sequence), we can we now simulate entire chromosomes for millions of samples under the exact coalescent model in a few hours. We discuss some applications that these simulations have made possible, including a study of biases in human GWAS and the systematic benchmarking of variant processing tools at scale. The tree sequence data structure is also an extremely concise way of representing genetic variation data, and we show how variant data for millions of simulated human samples can be stored in only a few gigabytes. Moreover, we show that this very high level of compression does not incur a decompression cost. Because the information is represented in terms of the underlying genealogies, operations such as computing allele frequencies on sample subsets or measuring of linkage disequilibrium can be made very efficient. Finally, we discuss ongoing work on inferring tree sequences from observed data and present some preliminary results. April 26, 2017 Peter Kharchenko (http://pklab.med.harvard.edu/index.html) Department of Biomedical Informatics, Harvard Medical School From one to millions of cells: computational challenges in single-cell analysis Abstract: Over the last five years, our ability to isolate and analyze detailed molecular features of individual cells has expanded greatly. In particular, the number of cells measured by single-cell RNA-seq (scRNA-seq) experiments has gone from dozens to over a million cells, thanks to improved protocols and fluidic handling. Analysis of such data can provide detailed information on the composition of heterogeneous biological samples, and variety of cellular processes that altogether comprise the cellular state. Such inferences, however, require careful statistical treatment, to take into account measurement noise as well as inherent biological stochasticity. I will discuss several approaches we have developed to address such problems, including error modeling techniques, statistical interrogation of heterogeneity using gene sets, and visualization of complex heterogeneity patterns, implemented in PAGODA package. I will discuss how these approaches have been modified to enable fast analysis of very large datasets in PAGODA2, and how the flow of typical scRNA-seq analysis can be adapted to take advantage of potentially extensive repositories of scRNA-seq measurements. Finally, I will illustrate how such approaches can be used to study transcriptional and epigenetic heterogeneity in human brains. Jean Fan (http://scholar.harvard.edu/jeanfan/home) Harvard Medical School, Kharchenko Lab Primer: Linking genetic and transcriptional intratumoral heterogeneity at the single cell level May 10, 2017 Ahmed Badran Broad Fellow, Chemical Biology & Therapeutic Sciences Continuous directed evolution: advances, applications, and opportunities Abstract: The development and application of methods for the laboratory evolution of biomolecules has rapidly progressed over the last few decades. Advancements in continuous microbe culturing and selection design have facilitated the development of new technologies that enable the continuous directed evolution of proteins and nucleic acids. These technologies have the potential to support the extremely rapid evolution of biomolecules with tailor-made functional properties. Continuous evolution methods must support all of the key steps of laboratory evolution — translation of genes into gene products, selection or screening, replication of genes encoding the most fit gene products, and mutation of surviving genes — in a self-sustaining manner that requires little or no researcher intervention. In this presentation, I will describe the basis and applications of our Phage-Assisted Continuous Evolution (PACE) platform, solutions we have devised to address known limitations in the technique, and opportunities to improve PACE where in silico computation may play a key role. Through these tools, we aspire to enable researchers to address increasingly complex biological questions and to access biomolecules with novel or even unprecedented properties. May 17, 2017 Tamara Broderick (http://www.tamarabroderick.com/) MIT EECS, CSAIL, and IDSS Edge-exchangeable graphs, clustering, and sparsity Abstract: Many popular network models rely on the assumption of (vertex) exchangeability, in which the distribution of the graph is invariant to relabelings of the vertices. However, the Aldous-Hoover theorem guarantees that these graphs are dense or empty with probability one, whereas many real-world graphs are sparse. We present an alternative notion of exchangeability for random graphs, which we call edge exchangeability, in which the distribution of a graph sequence is invariant to the order of the edges. We demonstrate that a wide range of edge-exchangeable models, unlike any models that are traditionally vertex-exchangeable, can exhibit sparsity. To develop characterization theorems for edge-exchangeable graphs analogous to the powerful AldousHoover theorem for vertex-exchangeable graphs, we turn to a seemingly different combinatorial problem: clustering. Clustering involves placing entities into mutually exclusive categories. A "feature allocation" relaxes the requirement of mutual exclusivity and allows entities to belong simultaneously to multiple categories. In the case of clustering the class of probability distributions over exchangeable partitions of a dataset has been characterized (via "exchangeable partition probability functions” and the "Kingman paintbox"). These characterizations support an elegant nonparametric Bayesian framework for clustering in which the number of clusters is not assumed to be known a priori. We show how these characterizations can be extended to feature allocations and, from there, to edge-exchangeable graphs. Tamara Broderick Primer: Nonparametric Bayesian Models, methods, and applications Abstract: Nonparametric Bayesian methods make use of infinite-dimensional mathematical structures to allow the practitioner to learn more from their data as the size of their data set grows. What does that mean, and how does it work in practice? In this tutorial, we'll cover why machine learning and statistics need more than just parametric Bayesian inference. We'll introduce such foundational nonparametric Bayesian models as the Dirichlet process and Chinese restaurant process and touch on the wide variety of models available in nonparametric Bayes. Along the way, we'll see what exactly nonparametric Bayesian methods are and what they accomplish. May 24, 2017 Fabian Theis (http://fabian.theis.name/) Helmholtz Zentrum München, TU Munich Reconstructing trajectories and branching lineages in single cell genomics Abstract: Single-cell technologies have gained popularity in developmental biology because they allow resolving potential heterogeneities due to asynchronicity of differentiating cells. Common data analysis encompasses normalization, followed by dimension reduction and clustering to identify subgroups. However, in the case of cellular differentiation, we may not expect clear clusters to be present - instead cells tend to follow continuous branching lineages. In this talk I will first review methods for pseudotime ordering of cells according to their single cell profiles, which are used for reconstructing such trajectories. Then I will show that modeling the high-dimensional state space as a diffusion process, where cells move to close-by cells with a distance-dependent probability well reflects the differentiating characteristics. Based on the underlying diffusion map transition kernel, cells can be ordered according to a diffusion pseudotime (DPT), which allows for a robust identification of branching decisions and corresponding trajectories of single cells. After application to blood stem cell differentiation, I finish with current extensions towards single cell RNAseq time series and population models as well as drivergene identification. June 8, 2017 Ed Finn (http://csi.asu.edu/people/ed-finn/) Center for Science and the Imagination, Arizona State University What Algorithms Want Abstract: We depend on — we believe in — algorithms to help us get a ride, choose which book to buy, execute a mathematical proof. It is as if we think of code as a magic spell, an incantation to reveal what we need to know and even what we want. But how do we navigate the gap between what algorithms really do and all the things we think, and hope, they do? This talk explores the evolving figure of the algorithm as it bridges the idealized space of computation and messy reality, with unpredictable and sometimes fascinating results. Drawing on sources that range from Neal Stephenson’s “Snow Crash” to Diderot’s “Encyclopédie,” from Adam Smith to the “Star Trek” computer, Finn explores the gap between theoretical ideas and pragmatic instructions, and the consequences of that gap for research at the intersection of computation and culture. Fall 2016 Schedule: 8:30am Primer, 9:20am Breakfast, 9:30am Seminar, 10:30am Discussion; all in Monadnock DATE

SPEAKER

AFFILIATION

TITLE

Sep 14

Brian Cleary [preprint

Regev and Lander Labs

Composite measurements and

(http://biorxiv.org/content/early/2017 Broad Institute and MIT CSBi

molecular compressed sensing for

/01/02/091926), video]

efficient transcriptomics at scale

Sep 21

Alp Kucukelbir

Columbia CS

Automated Inference and the Promise

(http://www.proditus.com/) [video,

of Probabilistic Programming [edward

slides]

(http://edwardlib.org/)]

Sep 28

Rafael Irizarry [video]

Dana-Farber Cancer Institute

Overcoming Bias and Batch Effects in

Harvard School Public Health

High-Throughput Data [paper (http://biorxiv.org/content/early/2015/08/ 28/025767)]

Oct 12

Chad Giusti

Warren Center for Network and Data

(http://www.chadgiusti.com/) [video, Sciences,

Topological data analysis: What can persistent homology see?

slides]

Math Dept, U Penn

Umut Eser [video, slides]

Churchman Lab, HMS

Oct 26

FIDDLE: An integrative deep learning framework for functional genomic data inference [paper (http://biorxiv.org/content/early/2016/10/ 17/081380)]

Nov 2

Anshul Kundaje [video]

Genetics Dept, CS Dept

Integrative, interpretable deep learning

Stanford University

frameworks for regulatory genomics and epigenomics

Nov 9

Quaid Morris

University of Toronto

Algorithms for reconstructing tumor

(http://www.morrislab.ca/) [video]

evolution [paper (http://genomebiology.biomedcentral.co m/articles/10.1186/s13059-015-06028)]

Nov 16

Dayong Wang

HMS

Deep learning for computational

(http://dayongwang.info/), Andy

pathology

Beck (http://becklab.hms.harvard.edu/) [vi deo, slides] Nov 30

Computational Aspects of

Microsoft Research

Biological Information 2016

New England

Ryan Peckner [video, slides]

Proteomics Platform

Dec 7

Spectral unmixing for next-generation mass spectrometry proteomics

Dec 14

Daniel Huang [video, slides]

Harvard SEAS

Compiling probabilistic programs

September 14, 2016 Brian Cleary Regev and Lander Labs Broad Institute and MIT CSBi Composite measurements and molecular compressed sensing for efficient transcriptomics at scale Abstract: Comprehensive RNA profiling provides an excellent phenotype of cellular responses and tissue states, but can be prohibitively expensive to generate at the massive scale required for studies of regulatory circuits, genetic states or perturbation screens. However, because expression profiles may reflect a limited number of degrees of freedom, a smaller number of measurements might suffice to capture most of the information. Here, we use existing mathematical guarantees to demonstrate that gene expression information can be preserved in a random low dimensional space. We propose that samples can be directly observed in low dimension through a fundamentally new type of measurement that distributes a single readout across many genes. We show by simulation that as few as 100 of these randomly composed measurements are needed to accurately estimate the global similarity between any pair of samples. Furthermore, we show that methods of compressive sensing can be used to recover gene abundances from drastically under-sampled measurements, even in the absence of any prior knowledge of gene-togene correlations. Finally, we propose an experimental scheme for such composite measurements. Thus, compressive sensing and composite measurements can become the basis of a massive scale up in the number of samples that can be profiled, opening new opportunities in the study of single cells, complex tissues, perturbation screens and expression-based diagnostics. September 21, 2016 Alp Kucukelbir (http://www.proditus.com/) Columbia CS Automated Inference and the Promise of Probabilistic Programming Abstract: Generative probability models allow us to 1) express assumptions about hidden patterns in data, 2) infer such hidden patterns, and 3) evaluate the accuracy of our findings. However, designing modern models, developing custom inference algorithms, and evaluating accuracy requires enormous effort and cross-disciplinary expertise. Probabilistic programming promises to enable this process by making each step less arduous and more automated. I will begin describing how probabilistic programming can help design modern probability models. I will then focus on automating inference for a wide class of probability models. To this end, I will describe automatic differentiation variational inference, a fully automated approximate inference algorithm. I will demonstrate its application to a mixture modeling analysis of a dataset with millions of observations. I intend to conclude with some thoughts on model evaluation, with a population genetics example. Throughout this talk, I will highlight connections to our software project,

Edward (http://edwardlib.org/): a Python library for

probabilistic modeling, inference, and evaluation. Rajesh Ranganath Princeton CS Primer: Probabilistic Generative Models and Posterior Inference Abstract: To model data we desire to express assumptions about the data, infer hidden structure, make predictions, and simulate new data. In this talk, I will describe how probabilistic generative models provide a common toolkit to meet these challenges. I will first present these ideas in a toy setting followed by discussing the range of probabilistic generative models from structural to algorithmic. Next I will present an in depth view of deep exponential families, a class of probability models containing both predictive and interpretive models. I will end with the central computational problem in realizing the promise of probabilistic generative models: posterior inference. I will demonstrate why deriving inference is tedious and will touch on black box variational methods which seek to alleviate this burden. September 28, 2016 Rafael Irizarry Dana-Farber Cancer Institute Harvard School Public Health Overcoming Bias and Batch Effects in High-Throughput Data Abstract: The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming science. In the life sciences, data analysis is now part of practically every research project. Genomics, in particular, is being driven by new measurement technologies that permit us to observe certain molecular entities for the first time. These observations are leading to discoveries analogous to identifying microorganisms and other breakthroughs permitted by the invention of the microscope. An examples of this are the many application of next generation sequencing. Biases, systematic errors and unexpected variability are common in biological data. Failure to discover these problems often leads to flawed analyses and false discoveries. As datasets become larger, the potential of these biases to appear to be significant actually increases. In this talk I will describe several examples of these challenges using very specific examples from gene expression microarrays, RNA-seq, and single-cell assays. I will describe data science solution to these problems. Adrian Veres (http://www.adrianveres.com/) Harvard Sys Bio, HST Primer: Experimental and computational techniques underlying RNA-seq Abstract: We will provide an overview of the experimental and computational steps involved in RNA-seq for both bulk and singlecell experiments. We will begin with a brief review of Illumina short-read sequencing by synthesis; continue to describing the molecular biology used in preparing RNA-seq libraries; and discuss quality trimming, read alignment, transcript quantification and normalization of gene expression measures. We will conclude with a discussion of techniques commonly leveraged in single-cell RNA-Seq: linear pre-amplification, unique molecular identifiers (UMI/RMTs) and 3’-barcode counting. Throughout the primer, we will mention potential sources of bias that can be introduced at each step and why they occur. October 12, 2016 Chad Giusti (http://www.chadgiusti.com/) Warren Center for Network and Data Sciences Complex Systems Group & Department of Mathematics University of Pennsylvania What can persistent homology see? Abstract: The usual framework for TDA takes as its starting point that a data set is sampled (noisily) from a manifold embedded in a high dimensional space, and provides a reconstruction of topological features of that manifold. However, the underlying algebraic topology can be applied to data in a much broader sense, carries much richer information about the system than just the barcodes, and can be fine-tuned so it sees only features of the data we want it to see. I will discuss this framework broadly, with focus on few of these alternative viewpoints, including applications to neuroscience and matrix factorization. Ann Sizemore (http://www.aesizemore.com) Functional Cancer Genomics Broad Institute of MIT and Harvard Primer: What is persistent homology? Abstract: A fundamental question in big data analysis is if or how these points may be sampled, noisily, from an intrinsically lowdimensional geometric shape, called a manifold, embedded in a high dimensional “sensor” space. Topological data analysis (TDA) aims to measure the “intrinsic shape” of data and identify this manifold despite noise and the likely nonlinear embedding. I will discuss the basics of the fundamental tool in TDA called persistent homology, which assigns to a point cloud a count of topological features –roughly “holes” of various dimensions – with a measure of importance of each feature recorded in a “barcode” of the data to help distinguish the significant features from the noise. October 26, 2016 Umut Eser Churchman Lab Harvard Medical School FIDDLE: An integrative deep learning framework for functional genomic data inference Abstract: Numerous advances in sequencing technologies have revolutionized genomics through generating many types of genomic functional data. Statistical tools have been developed to analyze individual data types, but there lack strategies to integrate disparate datasets under a unified framework. Moreover, most analysis techniques heavily rely on feature selection and data preprocessing which increase the difficulty of addressing biological questions through the integration of multiple datasets. Here, we introduce FIDDLE (Flexible Integration of Data with Deep LEarning) an open source data-agnostic flexible integrative framework that learns a unified representation from multiple data types to infer another data type. As a case study, we use multiple Saccharomyces cerevisiae genomic datasets to predict global transcription start sites (TSS) through the simulation of TSS-seq data. We demonstrate that a type of data can be inferred from other sources of data types without manually specifying the relevant features and preprocessing. We show that models built from multiple genome-wide datasets perform profoundly better than models built from individual datasets. Thus, FIDDLE learns the complex synergistic relationship within individual datasets and, importantly, across datasets. Alex Wiltschko Twitter Cortex Primer: Automatic differentiation, the algorithm behind all deep neural networks Abstract: A painful and error-prone step of working with gradient-based models (deep neural networks being one kind) is actually deriving the gradient updates. Deep learning frameworks, like Torch, TensorFlow and Theano, have made this a great deal easier for a limited set of models — these frameworks save the user from doing any significant calculus by instead forcing the framework developers to do all of it. However, if a user wants to experiment with a new model type, or change some small detail the developers hadn’t planned, they are back to deriving gradients by hand. Fortunately, a 30+ year old idea, called “automatic differentiation”, and a one year old machine learning-oriented implementation of it, called “autograd”, can bring true and lasting peace to the hearts of model builders. With autograd, building and training even extremely exotic neural networks becomes as easy as describing the architecture. We will also address two practical questions — "What's the difference between all these deep learning libraries?" and "What does this all mean to me, as a biologist?" — as well as providing some detail and historical perspective on the topic of automatic differentiation. November 2, 2016 Anshul Kundaje Dept. of Genetics Dept. of Computer Science Stanford University Integrative, interpretable deep learning frameworks for regulatory genomics and epigenomics Abstract: We present generalizable and interpretable supervised deep learning frameworks to predict regulatory and epigenetic state of putative functional genomic elements by integrating raw DNA sequence with diverse chromatin assays such as ATACseq, DNase-seq or MNase-seq. First, we develop novel multi-channel, multi-modal CNNs that integrate DNA sequence and chromatin accessibity profiles (DNase-seq or ATAC-seq) to predict in-vivo binding sites of a diverse set of transcription factors (TF) across cell types with high accuracy. Our integrative models provide significant improvements over other state-of-the-art methods including recently published deep learning TF binding models. Next, we train multi-task, multi-modal deep CNNs to simultaneously predict multiple histone modifications and combinatorial chromatin state at regulatory elements by integrating DNA sequence, RNA-seq and ATAC-seq or a combination of DNase-seq and MNase-seq. Our models achieve high prediction accuracy even across cell-types revealing a fundamental predictive relationship between chromatin architecture and histone modifications. Finally, we develop DeepLIFT (Deep Linear Importance Feature Tracker), a novel interpretation engine for extracting predictive and biological meaningful patterns from deep neural networks (DNNs) for diverse genomic data types. DeepLIFT is the first method that can integrate the combined effects of multiple cooperating filters and compute importance scores accounting for redundant patterns. We apply DeepLIFT on our models to obtain unified TF sequence affinity models, infer high resolution point binding events of TFs, dissect regulatory sequence grammars involving homodimer and heterodimeric binding with co-factors, learn predictive chromatin architectural features and unravel the sequence and architectural heterogeneity of regulatory elements. November 9, 2016 Quaid Morris (http://www.morrislab.ca/) University of Toronto Algorithms for reconstructing tumor evolution Abstract: Tumors contain genetically heterogeneous cancerous subpopulations that can differ in their metastatic potential and response to treatment. Our work over the past few years has focused on using computational and statistical methods to reconstruct the phylogeny and the full genotypes of these subpopulations using data from high-throughput sequencing of tumor samples. Tumor subpopulations can be partially characterised by identifying tumor-associated somatic variants using short read sequencing. Subsequent inference of copy number variants or clustering of the variant allele frequencies (VAFs) can reveal the number of major subpopulations present in the tumor as well as the set of mutations which first appear in each subpopulation. Further analysis, and often different data, is needed to determine how the subpopulations relate to one another and whether they share any mutations. Ideally, this analysis would reconstruct the full genotypes of each subpopulation. I will describe my lab’s efforts to recover these full genotypes by reconstructing the tumor’s evolutionary history. We do this by fitting subpopulation phylogenies to the VAFs. In some circumstances, a full reconstruction is possible but often multiple phylogenies are consistent with the data. We have developed a number of methods (PhyloSub, PhyloWGS, treeCRP, PhyloSpan) that use Bayesian inference in non-parametric models to distinguish ambiguous and unambiguous portions of the phylogeny thereby explicitly representing reconstruction uncertainty. Our methods consider both single nucleotide variants as well as copy number variations and adapt to data on pairs of mutations. David Benjamin Data Sciences & Data Engineering Primer: Intro to Dirichlet Processes Abstract: At a mundane level, Dirichlet processes are a clustering algorithm that determines the number of clusters. However, they are also a way to do Bayesian inference on a single infinite model rather than ad hoc model selection on a series of finite models and are the gateway to the field of Bayesian non-parametric models. Many introductions to Dirichlet processes take a formal measure-theoretic approach. In contrast, if you can understand the multinomial distribution you will understand this primer. November 16, 2016 Dayong Wang (http://dayongwang.info/), Andy Beck (http://becklab.hms.harvard.edu/) Beck Lab, Harvard Medical School at Beth Israel Deaconess Medical Center Deep learning for computational pathology Abstract: In this talk, we will provide an introduction to computational pathology, which is an emerging cross-discipline between pathology and computer engineering. Besides, we will introduce a deep learning-based automatic whole slide image analysis system for the identification of cancer metastases in breast sentinel lymph nodes. Our system won the 1st position in the International Challenge: Camelyon16, which was held at the International Symposium on Biomedical Imaging (ISBI) 2016. The system achieved an area under the receiver operating curve (AUC) of 0.925 for the task of whole slide image classification and an average sensitivity of 0.705 for the tumor localization task. A pathologist independently reviewed the same images, obtaining a whole slide image classification AUC of 0.966 and a tumor localization score of 0.733. By combining the predictions from the human pathologist and the automatic analysis system, the performance becomes even higher. These results demonstrate the power of using deep learning to produce significant improvements in the accuracy of pathological diagnoses. Babak Ehteshami Bejnordi (http://diagnijmegen.nl/index.php/Person?name=Babak_Ehteshami_Bejnordi) Beck Lab, Harvard Medical School at Beth Israel Deaconess Medical Center Primer: Practical recommendations for training convolutional neural nets Abstract: Deep learning, in particular convolutional neural network (ConvNet), is rapidly emerging as one of the most successful approaches for image and speech recognition. What distinguishes ConvNets and other deep learning systems from conventional machine learning techniques is their ability to learn the entire perception process from end to end. Deep learning systems use multiple nonlinear processing layers to learn useful representations of features directly from data. Searching the parameter space of deep architectures is a complex optimization task. ConvNets can be very sensitive to the setting of their hyper-parameters and network architecture setting. In this talk, I will give practical recommendations for training ConvNets and discuss the motivation and principles behind them. I will also provide recommendations on how to tackle various problems in analyzing medical image data such as lack of data, highly skewed class distributions, etc. Finally, I will introduce some of the advanced ConvNet architectures used in medical image analysis and their suitability for various tasks such as detection, classification, and segmentation. December 7, 2016 Ryan Peckner Proteomics Platform Spectral unmixing for next-generation mass spectrometry proteomics Abstract: Mass spectrometry proteomics is the method of choice for large-scale quantitation of proteins in biological samples, allowing rapid measurement of the concentrations of thousands of proteins in various modified forms. However, this technique still faces fundamental challenges in terms of reproducibility, bias, and comprehensiveness of proteome coverage. Next-generation mass spectrometry, also known as data-independent acquisition, is a promising new approach with the potential to measure the proteome in a far more comprehensive and reproducible fashion than existing methods, but it has lacked a computational framework suited to the highly convoluted spectra it inherently produces. I will discuss Specter, an algorithm that employs linear unmixing to disambiguate the signals of individual proteins and peptides in next-generation mass spectra. In addition to describing the linear algebra underlying Specter, we'll discuss its implementation in Spark with Python, and see several real datasets to which it's been applied. Karsten Krug Proteomics Platform Primer: Mass spectrometry-based proteomics Abstract: Mass spectrometry is the workhorse technology to study the abundance and composition of proteins, the key players in every living cell. Within the last decade the technology experienced a revolution in terms of novel instrumentation and optimized sample handling protocols resulting in ever growing numbers of proteins and post-translational modifications that can be routinely studied on a system-wide scale. Briefly, proteins are extracted from cells or tissues and fragmented into smaller peptides. This extremely complex peptide mixture is subjected to liquid chromatography separation and subsequent tandem mass spectrometry analysis in which mass-to-charge ratios of intact peptides and peptide fragments are recorded. Resulting mass spectra are matched to sequence databases or spectral libraries to read out the amino acid sequences and thereby identify the corresponding proteins. The technology is fundamentally different from sequencing-based genomics technology and faces different problems, such as the tremendous dynamic range of protein expression. The instruments can be operated in different acquisition modes for different applications. I will briefly introduce the basics behind discovery or ‘shotgun’ proteomics, targeted proteomics, data dependent acquisition and data independent acquisition; the latter is a recent and promising development in the proteomics community but poses novel and only partly solved challenges in data analysis. Ryan Peckner will talk about Specter, an approach that tackles this problem using linear algebra. December 14, 2016 Daniel Huang Harvard SEAS Compiling probabilistic programs Abstract: Deriving and implementing an inference algorithm for a probabilistic model can be a difficult and error-prone task. Alternatively, in probabilistic programming, a compiler is used to transform a model into an inference algorithm. In this talk, we'll present probabilistic programming from the perspective of a compiler writer. A compiler for a traditional language uses intermediate languages (ILs) and static analysis to generate efficient code. We'll highlight how these ideas can be used in probabilistic programming for generating flexible and scalable inference algorithms. Daniel King Hail Team, Neale Lab Primer: What is a compiler? Abstract: A compiler is an algorithm that transforms a source language into a target language. The transformation typically includes an optimizing pass which reduces memory or time requirements. Classic compilers transform languages such as C or Java into near-machine code such as x86 Assembly or JVM Bytecode. Recent work on Domain Specific Languages (DSLs) expands the notion of "source language" in order to enable everyone to build easy-to-reason-about abstractions without the performance penalty. In this context, I will discuss compiler design and implementation techniques with examples.

ologist, a mathematician, and a computer scientist walk into a foobar (http://www.broadinstitute.org/blog/biologist-

hematician-and-computer-scientist-walk-foobar)

Nov 12, the Broad welcomed a visit from Ryan Adams, a leader in machine learning - a field at the intersection of applied math computer science that develops models and algorithms to learn from data... Spring 2016 Schedule: 8:30am Primer, 9:30am Seminar, 10:30am Discussion in Monadnock DATE

SPEAKER

Jan 27

Brendan Frey

Feb 3

Shamil Sunyaev

AFFILIATION

TITLE

Toronto Eng / Med / CS,

Genomic Medicine: Will

CEO Deep Genomics

Software Eat Bio?

HMS, Brigham & Women’s,

Judging the importance of

(http://genetics.bwh.harvard.edu/ Broad

human mutations using

wiki/sunyaevlab/)

evolutionary models

Feb 10

Jeremy Gunawardena [slides]

Harvard Systems Biology

Systems biology: can mathematics lead experiments?

Feb 17

Caroline Uhler

MIT IDSS / EECS

(http://www.carolineuhler.com/) [v

Gene Regulation in Space and Time

ideo, slides] Feb 24

Geoffrey Schiebinger

Berkeley Stats

Sparse Inverse Problems [paper

(http://www.stat.berkeley.edu/~ge

(http://arxiv.org/abs/1404.7552),

off/)

paper (http://arxiv.org/abs/1506.03144)]

Mar 2

Leonid Mirny

MIT Physics, HST

(http://mirnylab.mit.edu/) [slides,

Polymer models of chromosomes

v1 (http://www.youtube.com/watch? v=_Vc7__xfnfc), v2 (http://www.youtube.com/watch? v=stZR5s9n32s)] Mar 9

Po-Ru Loh

HSPH, Price Lab

Haplotype phasing in large

(http://www.hsph.harvard.edu/po-

cohorts: Modeling, search, or

ru-loh/) [slides]

both? [paper (http://biorxiv.org/content/early/20 15/12/18/028282), code (http://www.hsph.harvard.edu/poru-loh/software/)]

Mar 16

Jeremy Freeman

Janelia Research Campus,

Open source tools for large-

(http://www.jeremyfreeman.net/) [ HHMI

scale neuroscience [paper

video]

(http://thefreemanlab.com/work/p apers/freeman-2015-currentopinion.pdf)]

Mar 23

D. Sculley

Google Brain,

A quick introduction to

(http://www.eecs.tufts.edu/~dscul Google Research Cambridge

TensorFlow and related

ley/)

API's [cell_paper]

Mar 30

John Wakeley

Harvard OEB (Chair)

The effects of population

(http://wakeleylab.oeb.harvard.ed

pedigrees on gene

u/)

genealogies [coalescent background (http://cseweb.ucsd.edu/classes/ sp06/cse280b/notes/nordborg_c oalescent.pdf)]

Apr 6

Abraham Heifets

CEO Atomwise

AtomNet: a deep convolutional

(http://www.cs.toronto.edu/~aheif (http://www.atomwise.com/)

neural net for bioactivity

ets/)

prediction in structure-based drug discovery [paper (http://arxiv.org/pdf/1510.02855v 1.pdf)]

Apr 13

Shantanu Singh,

Carpenter Lab,

Information in Cell Images:

Anne Carpenter,

Broad Imaging Platform

Targeting Diseases and

Mohammad Rohban

Characterizing Compounds [paper (http://www.sciencedirect.com/sci ence/article/pii/S0958166916301 112)]

Apr 20

Joshua Weinstein

Zhang Lab

DNA microscopy and the

(http://zlab.mit.edu/team.html)

sequence-to-image inverse problem

Apr 27

Yaniv Erlich

Columbia CS,

Compressed experiments [paper

(http://teamerlich.org/) [video,

NY Genome Center

(http://biorxiv.org/content/early/20

slides]

15/12/25/035352)]

Special

David Tse [video, slides]

Stanford EE

The Science of

session

1pm - 2pm, Yellowstone

UC Berkeley

Information: Case Studies from DNA and RNA Assembly [paper (http://bmcbioinformatics.biomed central.com/articles/10.1186/147 1-2105-14-S5-S18), paper (http://biorxiv.org/content/early/20 16/02/09/039230)]

May 4

Barbara Engelhardt

Princeton CS

Bayesian structured sparsity:

(http://www.cs.princeton.edu/~be

rethinking sparse regression

e/) [video]

[paper (http://arxiv.org/abs/1407.2235)]

May 11

David Blei

Columbia Data Science,

Scaling and Generalizing

(http://www.cs.columbia.edu/~ble Columba CS / Stats

Variational Inference [paper

i/) [video, intro slides]

(http://biorxiv.org/content/early/20 15/05/28/013227), edward]

May 18

Matei Zaharia [video, slides]

MIT CSAIL, EECS

Scaling data analysis with

Co-founder, CTO Databricks

Apache Spark [sparkRDD (http://wwwbcf.usc.edu/~minlanyu/teach/csci 599fall12/papers/nsdi_spark.pdf), MOOC]

May 25

MIA breakfast social

Su-In Lee

U Washington

Identifying molecular markers

Jun 1

(http://suinlee.cs.washington.edu CS, Genomics, EE

for cancer treatment from big

/)

data [paper (http://biorxiv.org/content/early/20 16/04/12/048215)]

January 27, 2016 Brendan Frey (http://www.psi.toronto.edu/~frey/) Professor, University of Toronto Fellow, Canadian Institute for Advanced Research Cofounder, Deep Genomics (http://www.deepgenomics.com/) Genomic Medicine: Will Software Eat Bio? Abstract: Deep learning will transform biology and medicine, but not in the way that many advocates think. Downloading ten thousand genomes and training a neural network to predict disease won't cut it. It is overly simplistic to believe that deep learning, or machine learning in general, can successfully be applied to genome data without taking into account biological processes that connect genotype to phenotype. The amount of data multiplied by the mutation frequency divided by the biological complexity and the number of hidden variables is too small. I’ll describe a rational “software meets bio” approach that has recently emerged in the research community and that is being pursued by dozens of young investigators. The approach has improved our ability to “read the genome”, and I believe it will have a significant impact on genome biology and medicine. I'll discuss which applications are ripe and which are merely seductive, how we should train models to take advantage of new types of data, and how we can interpret machine learning models. February 3, 2016 Shamil Sunyaev (http://genetics.bwh.harvard.edu/wiki/sunyaevlab/) Professor, Harvard Medical School Research Geneticist, Brigham & Women’s Hospital Associate Member, Broad Institute Judging the importance of human mutations using evolutionary models Abstract: Many forces influence the fate of alleles in populations, and the detailed quantitative description of the allelic dynamics is complex. However, some applications allow for simplifications making the evolutionary models useful in the context of human genetics. The examples include comparative genomics and the analysis of large scale sequencing datasets. February 10, 2016 Jeremy Gunawardena Professor, Harvard Medical School Department of Systems Biology Systems biology: can mathematics lead experiments? Abstract: The -omic revolution in biology, and parallel developments in microscopy and imaging, have opened up fascinating new opportunities for analysing biological data using tools from the mathematical sciences. However, the kind of data we have and the way we interpret them are determined by the conceptual landscape through which experimentalists reason about biology. In this talk, I will consider how mathematics can help to shape that conceptual landscape and thereby suggest new experimental strategies. I will describe some of our recent work on how eukaryotic genes are regulated, which tries to update conventional thinking in this field, which is largely derived from bacterial studies, and I will point out how this exercise gives rise to mathematical conjectures for which we currently have no solutions. February 17, 2016 Caroline Uhler (http://www.carolineuhler.com/) MIT EECS, IDSS Gene Regulation in Space and Time Abstract: Although the genetic information in each cell within an organism is identical, gene expression varies widely between different cell types. The quest to understand this phenomenon has led to many interesting mathematics problems. First, I will present a new method for learning gene regulatory networks. It overcomes the limitations of existing algorithms for learning directed graphs and is based on algebraic, geometric and combinatorial arguments. Second, I will analyze the hypothesis that the differential gene expression is related to the spatial organization of chromosomes. I will describe a bi-level optimization formulation to find minimal overlap configurations of ellipsoids and model chromosome arrangements. Analyzing the resulting ellipsoid configurations has important implications for the reprogramming of cells during development. February 24, 2016 Geoffrey Schiebinger (http://www.stat.berkeley.edu/~geoff/) UC Berkeley, Department of Statistics Sparse Inverse Problems Abstract: What can we learn by observing nature? How can we understand and predict natural phenomena? This talk is on the mathematics of precision measurement. How can we solve for the input that generated the output of some measurement apparatus? Our starting point is an information theoretic prior of sparsity. We investigate sparse inverse problems where we assume the input can be described by a small number of parameters. We introduce some of our recent theoretical results in superresolution and in spectral clustering. In particular, we show how to solve infinite dimensional deconvolution problems with finite dimensional convex optimization. And we show why dimensionality reduction can be such a useful preprocessing step for mixture models. March 2, 2016 Leonid Mirny (http://mirnylab.mit.edu/) MIT Physics, Health Sciences and Technology Polymer models of chromosomes Abstract: DNA of the human genome is 2m long and is folded into a structure that fits in a cell nucleus. One of the central physical questions here is the question of scales: How can microscopic processes of molecular interactions of nanometer scale drive chromosomal organization at microns? Inferring principles of 3D organization of chromosomes from a range of biological data is a challenging biophysical problem. We develop a top-down approach to biophysical modeling of chromosomes. Starting with a minimal set of biologically motivated interactions we build polymer models of chromosome organization that can reproduce major features observed in Hi-C and microscopy experiments. I will present our work on modeling organization of human metaphase and interphase chromosomes. March 9, 2016 Po-Ru Loh (http://www.hsph.harvard.edu/po-ru-loh/) Harvard School of Public Health, Price Lab Haplotype phasing in large cohorts: Modeling, search, or both? Abstract: Inferring haploid phase from diploid genotype data -- "phasing" for short -- is a fundamental question in human genetics and a key step in genotype imputation. How should one go about phasing a large cohort? The answer depends on how large. In this talk, I will contrast two approaches to computational phasing: hidden Markov models (HMMs), which perform precise but computationally expensive statistical inference, and long-range phasing (LRP), which relies instead on rapidly searching for long genomic segments shared among samples. I will present a new LRP method (Eagle), describe its performance on N=150,000 UK Biobank samples, and discuss future directions. March 16, 2016 Jeremy Freeman (http://www.jeremyfreeman.net/) Janelia Research Campus, HHMI Open source tools for large-scale neuroscience Abstract: Modern computing and the web are both enabling and changing how we do science. Using neuroscience as an example, I will highlight some of these developments, spanning a surprising diversity of technologies. I'll discuss distributed computing for data analytics, cloud computing and containerization for reproducibility, peer-to-peer networks for sharing data and knowledge, functional reactive programming for hardware control, and webgl for large-scale interactive experiments. And I will describe several open source projects we and others are working on across these domains. I hope to convey both what we're learning about the brain with these approaches, and how science itself is evolving in the process. March 23, 2016 D. Sculley (http://www.eecs.tufts.edu/~dsculley/) Google Brain, Google Research Cambridge A quick introduction to TensorFlow and related API's Abstract: TensorFlow was recently released to the open source world as a platform for developing cutting-edge ML models, with an emphasis on deep architectures including neural nets, convolutional neural nets, recurrent neural nets, and LSTM's. The open source version of TensorFlow now supports distributed computation across many machines, opening up a new level of scale to the research community. In this talk, we'll go over a quick introduction to the basic TensorFlow abstractions, and will also look at some higher-level API's that offer a convenient level of abstraction for many common use cases. Folks interested in learning more are encouraged to visit tensorflow.org, and the excellent Udacity course on ML featuring TensorFlow. March 30, 2016 John Wakeley (http://wakeleylab.oeb.harvard.edu/) Harvard Organismic and Evolutionary Biology (Chair) The effects of population pedigrees on gene genealogies Abstract: The models of coalescent theory for diploid organisms are wrongly based on averaging over reproductive, or family, relationships. In fact, the entire set of relationships, which may be called the population pedigree, is fixed by past events. Because of this, the standard equations of population genetics for probabilities of common ancestry are incorrect. However, the predictions of coalescent models appear surprisingly accurate for many purposes. A number of different scenarios will be investigated using simulations to illustrate the effects of pedigrees on gene genealogies both within and among loci. These scenarios include selective sweeps, the occurrence of very large families, and population subdivision with migration. April 6, 2016 Abraham Heifets (http://www.cs.toronto.edu/~aheifets/) CEO Atomwise (http://www.atomwise.com/) AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery Abstract: Deep convolutional neural networks (neural nets with a constrained architecture that leverages the spatial and temporal structure of the domain they model) achieve the best predictive performance in areas such as speech and image recognition. Such neural networks autonomously discover and hierarchically compose simple local features into complex models. We demonstrate that biochemical interactions, being similarly local, are amenable to automatic discovery and modeling by similarlyconstrained machine learning architectures. We describe the training of AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications, on millions of training examples derived from ChEMBL and the PDB. We visualize the automatically-derived convolutional filters and demonstrate that the system is discovering chemically sensible interactions. Finally, we demonstrate the utility of autonomously-discovered filters by outperforming previous docking approaches and achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. In further contrast to existing DNN techniques, we show that AtomNet’s application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. April 13, 2016 Shantanu Singh, Anne Carpenter, Mohammad Rohban Carpenter Lab, Broad Imaging Platform Information in Cell Images: Targeting Diseases and Characterizing Compounds Abstract: Our lab, the Broad’s Imaging Platform, aims to make perturbations in cell morphology as computable as other largescale functional genomics data. We began by creating model-based segmentation algorithms to identify regions of interest in images (usually, individual cells or compartments within them) and produced software that has become the world standard for image analysis from high-throughput microscopy experiments ( CellProfiler (http://www.cellprofiler.org), cited in 3000+ scientific papers). We have taken on a new challenge – using cell images to identify signatures of genes and chemicals, with the ultimate goal of finding the cause and potential cures of diseases. High-throughput microscopy enables imaging several thousand cells per chemical or genetic perturbation, and identifying multiple organelles using fluorescent markers yields hundreds of image features per cell. We use this rich information to construct perturbation signatures or “profiles”. Our goals in these profiling experiments include identifying drug targets and mechanisms of action, determining the functional impact of disease-related alleles, creating performance-diverse chemical libraries, categorizing mechanisms of drug toxicity, and uncovering diagnostic markers for psychiatric disease.The technical challenges we encounter include dealing with cellular subpopulation heterogeneity, interpreting and visualizing statistical models, learning better representations of the data, and integrating imaging information with other data modalities. April 20, 2016 Joshua Weinstein Zhang Lab (http://zlab.mit.edu/team.html) DNA microscopy and the sequence-to-image inverse problem Abstract: Technologies that jointly resolve both gene sequences and the spatial relationships of the cells that express them are playing an increasing role in deepening our understanding of tissue biology. In this talk, I will describe an experimental technique, called DNA microscopy, which encodes the physical structure and genetic composition of a biological sample directly into a library of DNA sequences. I will then discuss and demonstrate the application of N-body optimization to the inverse problem of inferring positions from real data. 9:30am, Monadnock, April 27, 2016 Yaniv Erlich (http://teamerlich.org/) Columbia University, New York Genome Center Compressed Experiments Abstract: Molecular biology increasingly relies on large screens where enormous numbers of specimens are systematically assayed in the search for a particular, rare outcome. These screens include the systematic testing of small molecules for potential drugs and testing the association between genetic variation and a phenotype of interest. While these screens are ``hypothesisfree,'' they can be wasteful; pooling the specimens and then testing the pools is more efficient. We articulate in precise mathematical ways the type of structures useful in combinatorial pooling designs so as to eliminate waste, to provide light weight, flexible, and modular designs. We show that Reed-Solomon codes, and more generally linear codes, satisfy all of these mathematical properties. We further demonstrate the power of this technique with Reed-Solomon-based biological experiments. We provide general purpose tools for experimentalists to construct and carry out practical pooling designs with rigorous guarantees for large screens. 1pm, Yellowstone, April 27, 2016 David Tse Stanford University and U.C. Berkeley The Science of Information: Case Studies in DNA and RNA Assembly Abstract: Claude Shannon invented information theory in 1948 to study the fundamental limits of communication. The theory not only establishes the baseline to judge all communication schemes but inspires the design of ones that are simultaneously information optimal and computationally efficient. In this talk, we discuss how this point of view can be applied on the problems of de novo DNA and RNA assembly from shotgun sequencing data. We establish information limits for these problems, and show how efficient assembly algorithms can be designed to attain these information limits, despite the fact that combinatorial optimization formulations of these problems are NP-hard. We discuss Shannon, a de novo RNA-seq assembly software designed based on such principles, and compare its performance against state-of-the-art assemblers on several datasets. May 4, 2016 Barbara Engelhardt (http://www.cs.princeton.edu/~bee/) Princeton University Bayesian structured sparsity: rethinking sparse regression Abstract: Sparse regression has become an indispensable method for data analysis in the last 20 years. The general framework for sparse regression has a number of drawbacks that we and others address in recent methods, including robustness of model selection, issues with correlated predictors, and a test statistic that is based on the size of the effect. All of these issues arise in the context of association mapping of genetic variants to quantitative traits. This talk will discuss one approach to structured sparse regression to mitigate these problems in the context of genome-wide association mapping with quantitative traits using a Gaussian process prior to add structure to the sparsity-inducing prior across predictors. We will also describe ongoing efforts for variants on this model for different analytic purposes, including neuroscience applications, identifying driver somatic mutations in cancer, and methods for causal inference in observational data with large numbers of instruments. May 11, 2016 David Blei (http://www.cs.columbia.edu/~blei/) Columbia University Scaling and Generalizing Variational Inference Abstract: Latent variable models have become a key tool for the modern statistician, letting us express complex assumptions about the hidden structures that underlie our data. Latent variable models have been successfully applied in numerous fields. The central computational problem in latent variable modeling is posterior inference, the problem of approximating the conditional distribution of the latent variables given the observations. Posterior inference is central to both exploratory tasks and predictive tasks. Approximate posterior inference algorithms have revolutionized Bayesian statistics, revealing its potential as a usable and general-purpose language for data analysis. Bayesian statistics, however, has not yet reached this potential. First, statisticians and scientists regularly encounter massive data sets, but existing approximate inference algorithms do not scale well. Second, most approximate inference algorithms are not generic; each must be adapted to the specific model at hand. In this talk I will discuss our recent research on addressing these two limitations. I will describe stochastic variational inference, an approximate inference algorithm for handling massive data sets. I will demonstrate its

application in genetics

(http://biorxiv.org/content/early/2015/05/28/013227) to the STRUCTURE model of Pritchard et al., 2000. Then I will discuss black box variational inference. Black box inference is a generic algorithm for approximating the posterior. We can easily apply it to many models with little model-specific derivation and few restrictions on their properties. I will demonstrate how we can use black box inference to develop new software tools for probabilistic modeling. May 18, 2016 Matei Zaharia MIT CSAIL, EECS; Co-founder, CTO Databricks Scaling data analysis with Apache Spark Abstract: [we neglected to request an abstract but are confident the speaker knows something about Spark] June 1, 2016 Su-In Lee (http://suinlee.cs.washington.edu/) University of Washington, CS, EE and Genome Sciences Identifying molecular markers for cancer treatment from big data Abstract: The repertoire of drugs for patients with cancer is rapidly expanding, however cancers that appear pathologically similar often respond differently to the same drug regimens. Methods to better match patients to specific drugs are in high demand. For example, patients over 65 with acute myeloid leukemia (AML), an aggressive blood cancer, have no better prognosis today than they did in 1980. For a growing number of diseases, there is a fair amount of data on molecular profiles from patients. The most important step necessary to realize the ultimate goal is to identify

molecular markers in these data that predict treatment

outcomes, such as response to each chemotherapy drug. However, due to the high-dimensionality (i.e., the number of variables is much greater than the number of samples) along with potential biological or experimental confounders, it is an open challenge to identify robust biomarkers that are replicated across different studies. In this talk, I will present two novel machine learning algorithms to resolve these challenges. These methods learn the low-dimensional features that are likely to represent important molecular events in the disease process in an unsupervised fashion, based on molecular profiles from multiple populations of cancer patients. These algorithms led to the identification of novel molecular markers in AML and ovarian cancer.

Copyright © 2018 Broad Institute. All rights reserved.

Models, Inference & Algorithms | Broad Institute [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch